Checklist for Artificial Intelligence in Medical Imaging Reporting Adherence in Peer-Reviewed and Preprint Manuscripts With the Highest Altmetric Attention Scores: A Meta-Research Study

Abstract

French

Purpose: To establish reporting adherence to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) in diagnostic accuracy AI studies with the highest Altmetric Attention Scores (AAS), and to compare completeness of reporting between peer-reviewed manuscripts and preprints. Methods: MEDLINE, EMBASE, arXiv, bioRxiv, and medRxiv were retrospectively searched for 100 diagnostic accuracy medical imaging AI studies in peer-reviewed journals and preprint platforms with the highest AAS since the release of CLAIM to June 24, 2021. Studies were evaluated for adherence to the 42-item CLAIM checklist with comparison between peer-reviewed manuscripts and preprints. The impact of additional factors was explored including body region, models on COVID-19 diagnosis and journal impact factor. Results: Median CLAIM adherence was 48% (20/42). The median CLAIM score of manuscripts published in peer-reviewed journals was higher than preprints, 57% (24/42) vs 40% (16/42), P < .0001. Chest radiology was the body region with the least complete reporting (P = .0352), with manuscripts on COVID-19 less complete than others (43% vs 54%, P = .0002). For studies published in peer-reviewed journals with an impact factor, the CLAIM score correlated with impact factor, rho = 0.43, P = .0040. Completeness of reporting based on CLAIM score had a positive correlation with a study’s AAS, rho = 0.68, P < .0001. Conclusions: Overall reporting adherence to CLAIM is low in imaging diagnostic accuracy AI studies with the highest AAS, with preprints reporting fewer study details than peer-reviewed manuscripts. Improved CLAIM adherence could promote adoption of AI into clinical practice and facilitate investigators building upon prior works.

Keywords

artificial intelligence evidence-based medicine guideline methods radiology research design

Introduction

Research into the application of artificial intelligence (AI) to radiology is becoming increasingly popular. The number of publications on AI in medical imaging increased from about 100-150 per year in 2007-2008 to 700-800 per year in 2016-2017.¹ In 2018-2019, about 25% of published articles in Radiology were related to radiomics or AI.² A significant proportion of these publications assessed diagnostic accuracy; a systematic review by Kelly et al in April 2022 found that over 30% of radiology AI studies applied AI to classify lesions using medical imaging.³ The completeness of reporting of diagnostic accuracy studies has been shown to vary. This can make it difficult for readers to assess the generalizability and sources of bias within a study and which may be a barrier to the acceptance of a study’s findings in clinical practice or as a basis for future research.^4,5

In March 2020, the Checklist for AI in Medical Imaging (CLAIM) guide was published by Mongan, Moy, and Kahn Jr and endorsed by Radiology: Artificial Intelligence.⁶ CLAIM is a 42-item checklist that builds upon the Standards for Reporting of Diagnostic Accuracy Studies (STARD) 2015 statement, designed to evaluate reporting completeness specific to applications of AI in medical imaging.⁷ Since its release, the adherence to CLAIM has been unclear. Clarifying the uptake of CLAIM and identifying items that are infrequently reported would provide guidance to publishing platforms and investigators by identifying areas for improved reporting in future works. Understanding how reporting completeness compares between manuscripts in peer-reviewed journals and on non-peer-reviewed preprint platforms, where many diagnostic imaging AI studies are published, would alert readers to differences between these and encourage them to apply an appropriate degree of scrutiny based on the publication source. This could facilitate the acceptance and adoption of some AI models over others in practice.

The purpose of this study was to establish reporting adherence to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) in diagnostic accuracy AI studies with the highest AAS scores, and to compare completeness of reporting between peer-reviewed manuscripts and preprints.

Methods

Institutional Research Board approval was not sought as all data are available in the public domain. Reporting was per the relevant components of the PRISMA-DTA guideline.⁸ The study protocol was registered (Open Science Framework: osf.io/3djma).

Data Sources

A preliminary search strategy was developed with the assistance of a librarian to search MEDLINE, EMBASE, arXiv, bioRxiv, and medRxiv to retrospectively identify diagnostic accuracy AI studies published after the release of the CLAIM checklist.

Study Selection

Given the large number of medical imaging AI studies published in the past few years, we elected to limit the search to the 100 diagnostic accuracy AI studies with the highest Altmetric Attention Score (AAS) including 50 from peer-reviewed journals and 50 from preprint platforms. The AAS is a weighted scoring system that captures the impact of a study based on more traditional sources such as citations in indexed journals but also incorporates attention on social media platforms including Twitter, public documents, online references and a variety of other sources applicable to both print and preprint articles.⁹ The Altmetric Application Programming Interface (API) was queried on June 25, 2021 to retrieve the AAS to establish the impact of each study between March 25, 2020, and June 24, 2021.

Studies were included if they met the following criteria: (1) apply AI to diagnostic imaging of humans (excluding applications to histopathology images, animals); (2) English language; (3) main outcome was to determine a measure of diagnostic accuracy; (4) published after the online release of CLAIM (March 25, 2020) and before the search date of June 25, 2021; (5) AAS available via the Altmetric API. Duplicate studies, studies lacking a digital object identifier (DOI) or equivalent to query the Altmetric API, and studies that had no AAS calculated were excluded. Posters, conference presentations, and reviews were excluded as CLAIM is focused towards reporting in research manuscripts. Preprint articles were excluded if the same study was later published in a peer-reviewed journal (Figure 1).

Figure 1.

Flow diagram for included studies.

After the studies were retrieved, two authors (US, KW) independently screened the titles and abstracts in order of descending AAS according to the criteria above to identify the 50 print and 50 preprint articles with the highest AAS for inclusion. Disagreements were discussed, and in any case where consensus was not achieved, the decision was made by a more experienced third reviewer (CBvdP).

Data Extraction

Data was independently extracted by two reviewers (US, KW) using a data extraction sheet, with disagreements subsequently discussed and referred to a third reviewer if consensus could not be achieved (CBvdP). The following data were extracted for each study: title, first author, year of publication, country of corresponding author’s institution, whether the paper referenced CLAIM, imaging modality, subspecialty, body region, if the paper was related to COVID-19, if the paper involved an FDA-approved product, if the source code was available to the public, and adherence to each of the 42-items from the CLAIM guide. COVID-19 was included as a variable of interest as it was a common focus for diagnostic imaging AI research during the study period. For articles in peer-reviewed journals, the following data were also extracted: journal of publication, journal impact factor, and if the journal had formally adopted the CLAIM checklist.

Data Analysis

A CLAIM score based on the sum of the completed items from the CLAIM checklist was calculated for each article. Some CLAIM checklist items were not applicable to specific study designs, which were documented as not applicable (for example, item 27 recommends explaining any ensembles of models and is only applicable for studies that used ensembles). CLAIM score data normality were assessed using the Shapiro–Wilk test with means compared using the two-sample t test and medians compared using the Wilcoxon rank-sum test including between peer-reviewed manuscripts vs those on preprint platforms, studies that made their source code available vs those that did not, studies in Radiology: Artificial Intelligence vs all others, and studies on COVID-19 vs all others. The Fisher exact test and chi-square test were used to compare the proportion of studies that fulfilled CLAIM criteria for the CLAIM section/topics. The Kruskal–Wallis test was used to compare CLAIM scores for studies on different body regions. AAS was compared between prints and preprints using the Wilcoxon rank-sum test. Spearmen correlation was used to compare the CLAIM score with journal impact factor and AAS. P < .05 defined statistical significance. All analysis was performed using R (A language and environment for statistical computing, v4.1.3, Vienna, Austria).¹⁰

Results

Study Search

A total of 5143 results were identified after screening, of which 3434 were print and 1709 were preprint. Print articles were assessed in order of descending AAS until 50 articles had been identified that met the eligibility criteria. To reach this threshold, 184 total print articles were examined, of which 134 were excluded for not meeting eligibility criteria. Similarly, 661 preprint articles were examined, of which 611 were excluded. This yielded the included 100 diagnostic accuracy AI studies. For print articles, 19 were from the USA, 9 China, and the remainder were from 13 other countries including 9 published in Radiology, 5 in Radiology: Artificial Intelligence, and the remainder in 26 other journals. For preprints, 12 were from the USA, 10 China, 7 India, and the remainder were from 13 other countries with 39 on arXiv, 9 on medRxiv, and 2 on bioRxiv.

Overall CLAIM Adherence

Median CLAIM adherence was 48% (20/42). Several CLAIM items were reported more often than not, including metrics of model performance (99/100), the study goal (98/100 studies), and performance metrics for optimal model(s) on all data partitions (94/100). Others were most often not reported, including de-identification methods (6/100), intended sample size and how it was determined (4/100), and where the full study protocol can be accessed (3/100). For the 44 studies published in peer-reviewed journals with an impact factor, the CLAIM score correlated with journal impact factor, rho = 0.43, P = .0040. The CLAIM score also correlated with AAS, rho = 0.68, P < .0001.

Peer-Reviewed Manuscripts vs Preprints

The median CLAIM score of manuscripts published in peer-reviewed journals was higher than for those on preprint platforms, 57% (24/42) vs 40% (16/42), P < .0001, Table 1 and Figure 2. Peer-reviewed manuscripts had higher AAS scores than preprints, median 58 vs 5, P < .0001 (Figure 3).

Table 1.

Checklist for Artificial Intelligence in Medical Imaging Completeness Based on Section/Topic Comparing Articles in Peer-Reviewed Journals and on Preprint Platforms.

Section/Topic	No.	Peer-Reviewed Journal Articles, No. (%)	Preprint Articles, No. (%)	P Value
Section/Topic	No.	(N = 50)	(N = 50)	P Value
Title or abstract
	1	43 (86%)	44 (88%)	1
Abstract
	2	48 (96%)	28 (56%)	<.0001
Introduction
	3	50 (100%)	44 (88%)	.0267
	4	48 (96%)	43 (86%)	.1595
Methods
Study design	5	41 (82%)	10 (20%)	<.0001
Study design	6	50 (100%)	48 (96%)	.4949
Data	7	41 (82%)	46 (92%)	.2336
	8	45 (90%)	9 (18%)	<.0001
	9	34 (68%)	37 (74%)	.6594
	10	10 (20%)	14 (28%)	.0175
	11	35 (70%)	21 (42%)	.0088
	12	4 (8%)	2 (4%)	.6777
	13	10 (20%)	3 (6%)	.0713
Ground truth	14	39 (78%)	12 (24%)	<.0001
	15	17 (34%)	2 (4%)	.0002
	16	32 (64%)	11 (22%)	<.0001
	17	14 (28%)	2 (4%)	.0019
	18	22 (44%)	3 (6%)	<.0001
Data partitions	19	3 (6%)	1 (2%)	.6173
	20	31 (62%)	40 (80%)	.0779
	21	33 (66%)	45 (90%)	.0070
Model	22	32 (64%)	33 (66%)	1
	23	18 (36%)	32 (64%)	.0093
	24	19 (38%)	29 (58%)	.0716
Training	25	36 (72%)	43 (86%)	.1407
	26	13 (26%)	7 (14%)	.2113
	27	7 (14%)	6 (12%)	.5528
Evaluation	28	49 (98%)	50 (100%)	1
	29	33 (66%)	15 (30%)	.0007
	30	7 (14%)	1 (2%)	.0595
	31	12 (24%)	13 (26%)	1
	32	18 (36%)	6 (12%)	.0100
Results
Data	33	24 (48%)	2 (4%)	<.0001
Data	34	30 (60%)	7 (14%)	<.0001
Model performance	35	47 (94%)	47 (94%)	1
	36	32 (64%)	15 (30%)	.0013
	37	11 (22%)	23 (46%)	.0202
Discussion
	38	48 (96%)	13 (26%)	<.0001
	39	48 (96%)	24 (48%)	<.0001
Other information
	40	6 (12%)	1 (2%)	.1117
	41	3 (6%)	0 (0%)	.2424
	42	42 (84%)	19 (38%)	<.0001

Figure 2.

Number of print and preprint studies reporting each CLAIM item.

Figure 3.

CLAIM adherence based on Altmetric Attention Score. Lines of best fit generated using a linear model.

Additional Factors

Of the 100 articles included in this study, studies that made their code available to the public had higher median CLAIM scores than those that did not, 60% (25/42) vs 43% (18/42), P = .0006. CLAIM scores differed depending on the body region studied with studies reporting on chest pathologies having the lowest CLAIM scores, P = .0352 (Table 2, Figure 4). A total of 52 studies were related to COVID-19, including 15 manuscripts in peer-reviewed journals and 37 on preprint platforms. For studies on COVID-19 vs others, the median CLAIM score was 43% (18/42) vs 54% (22.5/42), P = .0002. None of the 100 studies included an FDA-approved machine learning product.

Table 2.

Checklist for Artificial Intelligence in Medical Imaging Scores Based on Body Region.

Body Region	Number of Studies	Median CLAIM Score
Thoracic	65	43% (18/42)
Neuro, head, and neck	12	51% (21.5/42)
Breast	11	60% (25/42)
Abdomen pelvis	4	49% (20.5/42)
Cardiac	4	56% (23.5/42)
Musculoskeletal	4	50% (21/42)

CLAIM scores were found to significantly differ depending on the body region assessed, P = .0352.

Figure 4.

Details of 100 selected studies. (A) Imaging modality of studies (B) Radiology subspecialty of studies.

Radiology: Artificial Intelligence was the only journal to have endorsed the CLAIM guideline. The mean CLAIM score for studies published in Radiology: Artificial Intelligence was nearly but not significantly greater than those published elsewhere, 55% (23/42) vs 48% (20/42), P = .0661.

Discussion

Since the online release of the CLAIM guide, we found that the 100 peer-reviewed and preprint diagnostic accuracy medical imaging AI studies with the highest AAS had overall low reporting adherence to CLAIM at 48% of items, with preprints having less complete reporting than peer-reviewed manuscripts. This highlights the need for improved adherence to the CLAIM guide in all manuscript types, and especially in pre-prints which lack peer-review. These findings suggest that readers should exercise a higher degree of caution when reviewing articles on preprint platforms recognizing that important methodology details may be insufficiently reported. Suboptimal reporting of diagnostic imaging AI research methodology may be a barrier to embracing AI into clinical practice and may limit the ability of investigators to build upon prior works.

Preprint manuscripts were more deficient at reporting the scientific and clinical background, study objectives and goals, study limitations, including potential bias, and implications for practice, whereas print manuscripts tended to provide less detail regarding technical parameters. These differences may be attributable to the lack of peer-review for preprints. It is also possible that peer-reviewed journals attract authors who are more clinically oriented and more familiar with the reporting of non-AI medical research in traditional journals, whereas preprint platforms may attract authors from computer science who have greater experience reporting technical details pertaining to AI model development. Future research on preprint manuscripts that went on to publish in peer-reviewed journals could provide insight into the effect of the editorial practice.

There have been prior studies on reporting quality in AI studies that have also found poor aggregate reporting completeness.^11,12 O’Shea et al. found comparable results to the current study with completeness percentages of 91% for identifying AI methodology and 69% for preprocessing data.¹¹ These items are mostly inherent in the training of AI and are usually necessary to reproduce the results. This may help to explain why these items are reported more frequently than others that are not necessary to train AI and not immediately necessary for replication. One such example is the robustness or sensitivity analysis (8/100), which often requires significant additional analysis of the model without guaranteeing any commensurate performance benefit.¹² Another example is sample size determination (4/100), especially since the amount of available medical imaging data is usually limited, so researchers simply use as much as possible rather than perform a power analysis.

During the early stages of the COVID-19 pandemic in 2020, there was a significant increase in the number of preprint manuscripts, and these manuscripts were published more rapidly than preprint manuscripts prior to the pandemic.¹³ Many studies explored the application of AI to diagnose COVID-19 on chest radiographs or CT exams, and despite the remarkable efforts of researchers in this field, our results suggest that these studies were deficient in reporting completeness. This is consistent with a review by Roberts et al and another by Wynants et al which similarly found severe methodological flaws, poor reporting, and high risk of bias in these studies.^14,15

Limitations

This study has several limitations. Reviewing manuscripts in the order of highest to lowest AAS may have introduced anchoring bias wherein the higher scoring articles served as the reference point for the lower scoring articles. Although using the AAS allowed us to interpret the reporting completeness of impactful studies, there is no guarantee that these findings are generalizable to other studies. While other measures of study impact could have been used, there is no ideal technique to establish a study’s true impact. For example, article citation counts can be biased by self-citations and would likely not have had sufficient time to accumulate to permit a meaningful comparison in the current study.^16,17 Finally, the CLAIM checklist provides little guidance on how to holistically interpret a manuscript in conjunction with its code, which is frequently released publicly as open source software along with the manuscript. If the code is provided with the article, it provides other researchers with a wealth of information about data preprocessing, as well as the structure and training of the neural network that may be self-explanatory and reproducible but not be captured in CLAIM.

Conclusion

Overall reporting adherence to CLAIM is low in medical imaging diagnostic accuracy AI studies with a high AAS, with preprint manuscripts reporting fewer study details than peer-reviewed manuscripts. There is much room for improved adherence to the CLAIM guide, which could further promote adoption of AI into clinical practice and enhance the ability of investigators to build upon prior works.

Footnotes

Acknowledgments

Thank you to arXiv for use of its open access interoperability.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Umaseh Sivanesan

References

Pesapane

Codari

Sardanelli

. Artificial intelligence in medical imaging: Threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp. 2018;2(1):35.

Bluemke

Moy

Bredella

, et al. Assessing radiology research on artificial intelligence: A brief guide for authors, reviewers, and readers-from the radiology editorial board. Radiology. 2020;294(3):487-489.

Kelly

Judge

Bollard

, et al. Radiology artificial intelligence: A systematic review and evaluation of methods (RAISE). Eur Radiol. Epub ahead of print 2022. doi:10.1007/s00330-022-08784-6

Prager

Bowdridge

Kareemi

Wright

McGrath

McInnes

MDF

. Adherence to the standards for reporting of diagnostic accuracy (STARD) 2015 guidelines in acute point-of-care ultrasound research. JAMA Netw Open. 2020;3(5):e203871.

van der Pol

McInnes

Petrcich

Tunis

Hanna

. Is quality and completeness of reporting of systematic reviews and meta-analyses published in high impact radiology journals associated with citation rates? PLoS One. 2015;10(3):e0119892.

Mongan

Moy

CharlesKahn

. Checklist for artificial intelligence in medical imaging (CLAIM): A guide for authors and reviewers. Radiology: Artif Intell. 2020;2(2):e200029.

Bossuyt

Reitsma

Bruns

, et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. Radiology. 2015;277(3):826-832.

McInnes

MDF

Moher

Thombs

, et al. Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: The PRISMA-DTA statement. JAMA. 2018;319(4):388-396.

Altmetric . The Donut and Altmetric Attention Score: An At-A-Glance Indicator of the Volume and Type of Attention a Research Output Has Received 2022. Accessed May 9, 2022. https://www.altmetric.com/

10.

Team RC . A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, v4.1.3; 2022. Accessed April 11, 2022. https://www.R-project.org/

11.

O’Shea

Sharkey

Cook

GJR

Goh

. Systematic review of research design and reporting of imaging studies applying convolutional neural networks for radiological cancer diagnosis. Eur Radiol. 2021;31(10):7969-7983.

12.

Razavi

Jakeman

Saltelli

, et al. The future of sensitivity analysis: An essential discipline for systems modeling and policy support. Environ Model Software. 2021;137:104954.

13.

Aviv-Reuven

Rosenfeld

. Publication patterns’ changes due to the COVID-19 pandemic: A longitudinal and short-term scientometric analysis. Scientometrics. 2021;126(8):6761-6784.

14.

Roberts

Driggs

Thorpe

, et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell. 2021;3(3):199-217.

15.

Wynants

Van Calster

Collins

, et al. Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal. BMJ. 2020;369:m1328.

16.

Van Noorden

Singh Chawla

. Hundreds of extreme self-citing scientists revealed in new database. Nature. 2019;572(7771):578-579.

17.

Ioannidis

JPA

Boyack

Baas

. Updated science-wide author databases of standardized citation indicators. PLoS Biol. 2020;18(10):e3000918.