Sage Journals: Discover world-class research

Abstract

Background

Artificial intelligence (AI) persists as a focal subject within the realm of medical imaging, heralding a multitude of prospective applications that span the comprehensive imaging lifecycle. However, a key hurdle to the development and real-world application of AI algorithms is the necessity for large amounts of well-organized and carefully planned training data, including professional annotations (labelling). Modern supervised AI techniques require thorough data curation to efficiently train, validate, and test models.

Objective

The proper processing of medical images for use by AI-driven solutions is a critical component in the development of dependable and resilient AI algorithms. Currently, research organizations and corporate entities frequently confront data access limits, working with small amounts of data from restricted geographic locations.

Methods

This study provides an in-depth examination of the publicly accessible datasets in the field of medical imaging. This work also determines the methods required for preparing medical imaging data for the development of AI algorithms, emphasizes current limitations in dataset curation. Furthermore, this study explores inventive strategies to address the challenge of data availability, offering a detailed overview of data curation technologies.

Results

This study provides a comprehensive evaluation of medical imaging datasets emphasizes their vital significance in improving diagnostic accuracy and AI models, while also addressing key problems such as dataset diversity, labelling, and ethical implications.

Conclusions

The paper concludes with an insightful discussion and analysis of challenges in medical image analysis, along with potential future directions in the field.

Keywords

Alzheimer's disease dataset curation tools dataset preparation medical image analysis medical imaging dataset

Introduction

Since the invention of medical imaging technology, the field of medicine had entered a new era. The beginning of medical imaging started with the adoption of x-rays. Artificial intelligence (AI) has emerged as an important topic in the field of health care during the last two decades,¹ especially for medical imaging.^2,3 These imaging techniques have been useful to diagnose and treat a variety of diseases. Medical imaging is critical for providing useful information into the diagnosis and treatment of a wide range of diseases. The utilization of various imaging modalities captures scans of the human body by exploiting its diverse responses. Reflection and transmission are prevalent techniques in medical imaging, leveraging the distinct reflection or transmission ratios of various body tissues and substances. Many researchers turned their attention to AI in the field of medical imaging, believing that it might be one of the solutions to problems (such as medical resource shortages) and take use of technology advancements.^4–6

A frequent challenge encountered in employing deep learning (DL) methods within a particular domain, especially in the context of medical image analysis, is the issue of insufficient data. In medical image analysis research, individuals typically utilizing DL methods are often computer scientists rather than having a background in the medical field. This data scarcity problem tends to be more pronounced in the realm of medical image analysis. Due to the absence of access to medical equipment and patients, these researchers are unable to independently gather data. Additionally, the lack of relevant medical knowledge prevents them from annotating the obtained data. Moreover, medical data ownership lies with institutions, which face challenges in making it publicly available due to privacy and ethical constraints. Consequently, when researchers assess their algorithms using private data, the comparability of their research results becomes compromised.⁷

Nowadays, numerous research groups and organizations do not have their access to medical images, and a small number of datasets impede the ability to be generalized and accuracy of developed methods. The objectives of these challenges include the enhancement and creation of automatic or semi-automatic algorithms, thereby encouraging research in medical image processing through the utilization of computer-aided methodologies.^7,8 Meanwhile, several scholars and organizations initiate efforts to collect and share medical datasets for research purposes. Furthermore, successful AI algorithms require effective curation, analysis, labelling, and clinical application.

In the present study, research publications were retrieved using several search keywords, including ‘ medical imaging datasets’, ‘dataset curation tools’, ‘medical imaging analysis’, and ‘DL/ML techniques for medical imaging’ from different academic databases. After accumulating articles from various databases, we used precise selection criteria to decide which ones to include or omit, as shown in Table 1.

Table 1.

Articles selection criteria.

Included studies	Excluded studies
Medical imaging datasets	Articles published in non-indexed journals
Medical imaging curation tools	Short length articles
Articles published between 2011��2024	Language other than English
Articles published in WoS and Scopus-indexed journals	Articles without relevant materials

Table 1 shows that research publications were retrieved using several search keywords, including “short-length articles or articles without relevant facts,” which were disregarded. publications for additional evaluation were chosen based on their quality assessment standard marks.

This study aims to tackle these issues by offering a detailed review of available medical datasets and outlining several important steps for preparing a large volume of medical image data. It also explores new methods to solve the problem of limited data availability, giving a comprehensive look at how data can be effectively curated. Finally, the paper wraps up with a discussion on the challenges of processing medical images and the future possibilities in this area. Table 2 shows the overview of current review studies on medical imaging.

Table 2.

Overview of recent review publications on medical imaging.

Study and year of publication	Medical image datasets	Data preparation	Data curation tools	Limitation and future work
Simpson et al. (2019)⁹	✔	✘	✘	✘
Tajbakhsh et al. (2020)¹⁰	✘	✔	✘	✔
Willemink et al. (2020)¹¹	✔	✔	✘	✔
de Araujo et al. (2022)¹²	✘	✔	✘	✔
Denner et al. (2023)¹³	✘	✔	✔	✘
Li et al. (2023)⁶	✔	✘	✘	✔
Our Work	✔	✔	✔	✔

: This table highlights the extent of discussion on various aspects of medical imaging research. Notably, data curation methods, which are critical for utilizing these datasets effectively, are often underrepresented in the literature.

Table 2 summarizes key review publications that have addressed medical imaging datasets, data preparation techniques, and future directions. This work makes the following specific contributions: (1) Table 3 shows medical imaging datasets with details on modality types, bodily organs, purposes, and file formats; (2) Data preparation as shown in Figures 1 and 2 and Tables 4 and 5; (3) Data curation tools as shown in Table 6; and (4) Discussion and future works.

Figure 1.

Medical image preparation process.

Figure 2.

List of HIPAA identifiers (https://www.healthcarecompliancepros.com/blog/phi-and-you-the-basics-you-need-to-know).

Table 3.

Medical image datasets

Database & online availability	Modality types	Body organ's	Aims & file format
COINS 2011¹⁴	MRI, MEG, EEG	Brain	Offer researchers an information system structured on an open-source model, incorporating web-based tools for managing studies, subjects, imaging, clinical data, and additional assessments. The system supports the DICOM format.
IDA 1990¹⁵	MRI, fMRI, PET etc.	Brain	Comprises medical images from various modalities acquired over different time periods, available in both DICOM and GIF file formats.
MIAS 1994¹⁶	MG, US	Breast	Types of tissue (background), class of abnormality present, severity of abnormality etc. for cancer screeningIn matrix form: 1024 × 1024
TCIA 2015¹⁷	CT, PET, X-rays, SPECT, MRI, Colon etc.	Breast, Chest, Brain, neck, lungs, heart, kidney	Encompasses 22 modalities covering 54 distinct organs and medical images from clinical trials, showcasing both normal and various diseased states of body organs,all stored in the DICOM file format.
OASIS 2010¹⁸	PET, MRI	Brain	Encompasses a total of 3059 subjects with MR sessions, 471 PET sessions, and 1472 CT sessions. The images cover a spectrum of Alzheimer's severity, including normal, mild, moderate, and severe cases, all stored in DICOM format but convertible to NIFTI or NRRD format.^19,20
ADNI 2003²¹	PET, MRI, fMRI	Brain	Contains MRI and PET both modalities, support CSV file format.
Allen Brain Atlas1995²²	CT, PET, SPECT, MRI	Brain	2D and 3D brain images depicting normal, neoplastic disease, cerebrovascular issues, infection diseases, and degenerative conditions, all stored in GIF file format.
MIDAS 2010²³	PET, MRI, CT, US, SPECT etc.	Liver, Heart, Brain, Head, Bones	Encompasses medical images from various modalities acquired at different time intervals, available in both DICOM and GIF file formats.
BCDR 2012²⁴	X-ray (BCDR-FM & BCDR-DM)	Breast	Detected breast anomalies, breast density and BIRADS ClassificationDICOM file Format
NLM Visible Human Project1995²⁵	CT, MRI	Multiple organ's	MRI-Male head, MRI-Male Thorax, MRI-Male Abdomen, MRI-Male Pelvis, MRI-Male Thigh, MRI-Male Feet etcPNG Format
TCGA 2005²⁶	Genomic data for all most all parts of the body	Full body	Genome sequencing and functional computationsBAM Data Format
DDSM 1999²⁷	X-rays	Breast	Consist of 2630 normal and primary cancer images with 42–50 microns of resolutions, support PGN file format
UCSB 2005¹⁹	Genomic data	Full body	Providing globally gathered values of genome-wide expression for all pixels at a given stage and orientation, support CSV Format
Grand Challenge Archive2021²⁰	MRI and Microarray data	Brain	Dynamic neural impulses across the brain execute basic processing and produce flexible behaviors, support XML and CSV file Format
NBIA 2023²⁸	Radiology	Liver, Heart, Brain, Head, Bones	Facilitates users in securely storing, searching, and downloading diagnostic medical images, offering a searchable repository that integrates in vivo cancer images with clinical data,all in DICOM format.
UK BIOBANK 2007²⁹	MRI	Brain, heart and full body	cancer, heart diseases, stroke, diabetes, arthritis, osteoporosis, eye disorders, depression, and forms of dementiaPMG Format
Kaggle 2010³⁰	Multiple	Multiple organs	Multiple activitiesCSV and others Format
Medical Segmentation Decathlon 2022³¹	CT, MRI	Multiple organs	Generalizable 3D Semantic SegmentationTar Format
OpenNeuro 2011³²	MRI, CT, fMRI, dMRI, SPECT	Brain	Free and open platform for validating and sharing BIDS-ComplaintsBIDS Format
NITRC 2006³³	Multiple	Brain	provides a containerized data analysis environment to facilitate reproducible analysis of neuroimaging dataDICOM and NIFTI Format
FITBIR 2011³⁴	MRI, PET	Brain	Traumatic brain injury (TBI)All formats
CQ500 2020³⁵	TBI, Stroke	Brain	Detection and critical findings in Head CT scanDICOM and JPEG
NDA 2018³⁶	MRI	Brain	Platform that facilitates data sharing across all mental health and other research groupsDICOM Format
CONNECTTOME 2001³⁷	sMRI, fMRI	Brain	Houses and distributed public research data focused on the connection with-in human brainMultiple Format
OMI-DB 2008³⁸	MG	Breast	optimize the adoption of new X-ray technology for detecting breast cancers and thereby to improve the early detection of breast cancersDICOM header and expert annotations
Cardiac Atlas Project 2023³⁹	MRI	Cardiac	A global consortium and online platform for the integration and dissemination of cardiac imaging examinations, including parametric model-derived functional analyses and relevant clinical data, all in DICOM format.
Cornell Engineering: Vision and Image Analysis Lab2019⁴⁰	CT	Lungs	It comprises lung CT images annotated by radiologists to aid researchers, all in DICOM format.
BIMCV-COVID-192022⁴¹	CR, DX, CT	Chest	Normal and effected ImagesDICOM and MIDS Format
CHASE_DB12012⁴²	Fundus Images	Retina	Vessel's analysis and Abnormalities detectionPNG and JPG Format

Table 4.

Some important medical image open-source solutions.

Name	API	User guide	Real world application	Source
Extensible Neuroimaging Archive Toolkit (XNAT)	Java Server Faces (JSF)	XNAT website	researchers, clinicians, and institutions	https://www.xnat.org/about/
Dicoogle	RESTful API	official Dicoogle websiteDicoogle GitHub repository	researchers, clinicians	https://www.dicoogle.com/
Kheops	RESTful API endpoints or SDKs	official Kheops documentation or website	Clinical	https://kheops.online/
OHIF	JavaScript	OHIF GitHub repository and the OHIF documentation website	Clinical	https://ohif.org/
PacsOne	MySQLJava	provided by the vendor or developer of the software	Research	https://www.pacsone.net/download.htm
Dcm4Che	RESTfulJava	official Dcm4Che website	Clinical	https://www.dcm4che.org/

Table 5.

Tools for medical image labelling.

Name	Open access	Web-based	Visualization in 3D/4D	Annotations	Source
Horos	Yes	No	Yes	allows users to add annotations and measurements to the medical images	www.horosproject.org
ePAD	Yes	Yes	Yes	Manual but directly in the platform	epad.stanford.edu
Seg3D	Yes	No	Yes	Manual	www.sci.utah.edu/cibc-software/seg3d
ParaView	Yes	No	Yes	Manual	www.paraview.org
MITK	Yes	No	Yes	Manual using interaction with interface directly	www.mitk.org
MeVisLab	Yes	No	Yes	Manual/Auto (Both)	www.mevislab.de
ImageJ	Yes	NoImage J with the cloud environment allowed user	Yes	Manual/Auto (Both)	fiji.sc

Table 6.

Tools for medical image curation.

Name	Basic application	Documentation	Source
XNAT(Extensible Neuroimaging Archive Toolkit)	Data management, sharing and tracking data, image segmentation and registration etc.	Open source	https://www.xnat.org/
Posda Tools	Handling DICOM data, Annotation and Metadata, Security, Integration etc.	Open source software platform	https://posda.com/
OsiriX	displays all types of DICOM files, 3D Rendering, 2D Endoscopy, multiplanar reconstruction, PACS integration	Website, open source	https://www.osirix-viewer.com/
MITK Workbench	view, process, and segment medical images, Visualization, GUI, and integration with ITK, VTK features	Open source, user manual available on website	https://www.mitk.org/wiki/The_Medical_Imaging_Interaction_Toolkit_(MITK)
BrainSuite	MRI processing, cortical surface modelling, diffusion, registration, etc.	Open source, available for download	https://brainsuite.org/
ImageJ	Data organization, annotation, quality control, integration, and providing web-based solutions for various image curation tasks	Open source, can download without any licence	https://imagej.net/ij/
Horos	Image viewing, annotation and measuring, 3D visualization, DICOM networking and so on.	Open source medical image viewer and DICOM image viewer	https://horosproject.org/
AMIDE(A MedicalImaging Data Examiner)	DICOM image viewing, Image analysing, Image fusion, 3D visualization, Cross platform compatibility etc.	Open source and provide user friendly interface to clinician, researchers etc.	https://amide.sourceforge.net/
OHIF(Open HealthImaging Foundation) Viewer	Web-based interface, DICOM viewer, navigating, zooming, panning, and adjusting image setting, security, privacy, annotation, etc.	Open source web-based medical image viewer	https://ohif.org/
BIDS Tools	DICOM image conversion to BIDS	Open source website available	https://github. com/bidsstandard
DCMTK	C/C++ Library, DICOM converter, compare, validate and network DICOM images	Open source, wiki code, website	https://dcmtk.org/en/
Mango(Multi-image Analysis GUI)	Multi-image analysis, DICOM viewing, 3D visualization, annotation and measurement, customization, etc.	Open source, provide user-friendlyinterface to clinician, researchers etc.	https://nmmitools.org/2020/06/15/mango/

The rest of the article is structured as follows:.

Section 2 provides an extensive overview of the medical imaging databases. Section 3 describes the steps for medical imaging data preparation. Section 4 introduces the data curation tools, while Section 5 examines existing issues and future directions. Section 6 concludes the review.

Medical image datasets

Obtaining imaging data is an essential element of developing AI algorithms for imaging diagnostics. These data sets are useful for training and testing of AI algorithms. Considering the importance of patient privacy, many market oriented AI models rely on private datasets or hospital datasets that are not publicly available. The main aim of this study is to explore the best accessible medical imaging datasets along with modality types, body organs, medical image classification and file format in depth. We believe that this succinct overview will help the scholars in an efficient and straightforward manner. Most of the datasets are open access; however, few of them require registration to view the data. Table 3 displays the details of the state-of-the-art medical imaging datasets.

There are many challenges to working with medical imaging datasets. Many of them were explored at MICCAI (a well-known conference—https://miccai.org/), and researchers/organizations attempted to address these difficulties by providing specialized datasets. Starting in 2018, MICCAI additionally developed an online platform for sharing and addressing these challenges. Table 2 presents many well-known databases, such as TCIA data repository provides curated imaging sets for several organs, focusing on cancer imaging.¹⁷ The database expanded to include x-ray and computed tomography (CT) images for COVID-19 patients. Images from this site can be downloaded in collections based on a common condition or imaging modality. Similarly, the UK Biobank is another important resource in the field of medical data collection and research.²⁹ In addition to a wide range of clinical data, such as electronic health records (EHR), it has imaging collections from over 100,000 individuals, including scans of the abdomen, brain, heart, carotid artery, and bones.

The imaging repositories mentioned earlier contain datasets spanning multiple organs and diverse medical conditions. Nonetheless, there are additional initiatives for data collection that concentrate on specific organs. For instance, substantial collections of neuroimaging datasets are available from repositories such as IDA, OASIS, NITRC, CQ500, and so on. These repositories encompass imaging data for healthy individuals across different age groups, as well as patient data pertaining to various neurological disorders. Medical imaging databases are essential for clinical AI applications. The accuracy, reliability, and impartiality of diagnostic methods are determined by the datasets used to train and evaluate the models.

Data preparation

Data preparation is the most critical stage before using medical images for developing AI techniques.¹¹ One of the primary objectives of this research work is to offer a comprehensive overview of the medical image data preparation, which can be employed before and during the development, execution, and validation of AI algorithms. Next, we highlight the necessary steps when working with medical images.

Ethical process

Normally, obtaining approval from a local ethical committee is a prerequisite before utilizing medical data for the development of AI techniques.¹¹ A review committee is tasked with evaluating the risks and benefits of the study for patients.¹¹ In clinical studies, individual principal researchers may be required to grant permission for the disclosure of data concerning their patients.¹¹ After completing all ethical formalities, the relevant data should be made accessible, systematically searched, accurately de-identified, and securely stored. Any confidential patient data must be omitted from each of the DICOM metadata and the image files.¹¹ Figure 1 summarizes the process of preparing medical images for AI development.

Accessing data

AI algorithm developers often lack direct access to medical imaging data through PACS (Picture Archiving and Communication System), particularly for commercial purposes. PACS is a comprehensive solution for the storage, management, and retrieval of medical images. This approach transformed old film-based procedures by automating radiological images like MRI, CT, PET, and X-rays, enabling for computer storage. Enabling data access for AI developers is a challenging task that involves multiple processes, one of which is data de-identification (discussed below). After data is available to researchers, there are several ways for searching for medical images and clinical information.

Querying data

After data becomes accessible to AI developers, many methods exist for searching medical images and clinical information/data.¹¹ Researchers may use customized search commands to access medical data. Custom search commands may include strings, globally disease categorization codes, and modern medical terminology codes. PACS or radiology information system search engines can be used to conduct a systematic search and retrieval of data from hospital PACS and digital health records. Data must be regularly examined and extracted from both PACS and digital health records. Many PACS providers, for example, provide users with access to metadata like the annotations, the source, sequence, and image numbers, as well as unique target injury name and relationship. Researchers can access this data in some PACS and can further organize and control it by other systems such as digital cancer repositories, medical repositories, and other databases.⁴³ Alternatively, software tools exist to facilitate the process of data querying.^43,44

Image de-identification

Image de-identification is the process of removing or anonymizing personal information, such as name, address, and medical record number, from medical images to protect the privacy of those concerned.⁶ De-identification is crucial when sharing or using medical images for research, education, or other purposes outside of direct patient care. The de-identification process typically involves the various steps such as removal of patient name, date of birth, medical record number from metadata, anonymization of dates, overlay removal and so on.^6,11 These identification data is normally available in DICOM format, and many tools are available to remove this information. The goal of image de-identification is to balance the utility of medical images for research and educational purposes while safeguarding patient privacy and complying with ethical and legal standards, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States or similar regulations in other regions.¹¹ Figure 2 shows the list of 18 HIPAA identifiers.

Data storage

Medical image storage is the secure and effective archiving of medical images created by x-rays, CT scans, MRIs, ultrasounds, and other imaging modalities. Proper storage is vital for maintaining, retrieving, and preserving these valuable diagnostic and patient care assets. Here are the main characteristics of medical image storage:

Picture Archiving and Communication System (PACS): PACS is a comprehensive technology that facilitates the storage, retrieval, and distribution of medical images in a digital format. It typically integrates with medical imaging equipment and electronic health record (EHR) systems.

Storage infrastructure: Medical images tend to reside on dedicated and high-performance storage technology. It might be stored on a local server or on cloud-based, depending on medical institutions’ choices and needs.

Scalability: Data storage devices must have the ability to store large volumes of data due to the continues grow of medical data. Scalability ensures that medical professionals can increase their ability to store information to meet changing demands on the organization.

Security and compliance: Security is essential since medical images and patient data are extremely confidential. Storage devices must follow health privacy regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States or similar requirements in other regions.

Redundancy and backup: To prevent data loss, medical image storage devices often employ redundancy and backup methods. Redundancy assures that data is replicated across different storage devices or locations, while regular backups allow recovery of data in the case of device failures or other issues.

Integration with electronic health records (EHR): Integration with EHR systems allows healthcare professionals to access medical images alongside patient health records, providing a comprehensive view of a patient's medical history. HER is an electronic form of paper chart containing patient history, test results, treatment, medications, and so on. Medical doctors can access HER from different locations. Using HER, doctors can also exchange patient information to different hospitals.

Aside from commercial alternatives such as PACS, Vendor Natural Archives (VNA), there are various open-source solutions that require little commitment from academics and physicians to implement.^45,46 All these open-source alternatives offer platforms for creating your own server, but they do not provide a free storage facility, which would be expensive to operate. They are, however, extendable and include several plugins, allowing you to save medical images on the cloud with an individual provider that complies with data protection rules. Table 4 represents open-source medical image storage solutions.

In summary, medical image storage involves the careful management of digital images, ensuring their security, accessibility, and compliance with regulatory standards, to support effective patient care and medical research.

Quality control

Quality control in medical image processing is an important step in ensuring that the images obtained are of good quality, free of artifacts, and suitable for accurate diagnosis and analysis.^47,48 It is a systematic process of checking and evaluating images to identify any issues that may impact their quality. The overall accuracy, precision, and durability of images acquired throughout the acquisition stage. It includes a variety of parameters that influence the quality and usefulness of the collected images for their final usage.^49,50 The main parameters to improve the image quality are noise reduction, color accuracy, sharpness, and so on. Artifact identification and removal is the second step in quality control. It is the process of detecting and removing undesired features in images known as artifacts. Artifacts are distortions, abnormalities, or inconsistencies that occur during image acquisition, processing, or transmission.^51,52 They may affect the quality of images, hide essential details, and restrict proper interpretation and analysis. Similarly, image resolution plays an important role in different fields, including remote sensing, medical imaging, photography, and printing. It is typically the level of detail or brightness present in an image and is usually stated as the total number of pixels in the entire image.⁵¹ Resolution is a basic feature of digital images that is affected by the size of the image, pixel dimensions, and pixel intensity. The idea proposed in⁵³ used an anti-illumination approach toward SISR named Light-guided and Cross-fusion U-Net (LCUN), which can simultaneously improve the texture details and lighting of low-resolution images. During quality control, we also must remove the non-uniformity among different modalities such as PET, MRI, CT, x-ray, and so on. Inconsistency across several modalities might result in data misunderstanding or misinterpretation, which could jeopardize the treatment of patients.⁵¹ Moreover, the quality of visual display is crucial in several fields, including medical imaging, graphic design, digital painting, and multimedia production.^50,51 Image display quality includes several elements that influence image perception and interpretation, such as resolution, color correctness, contrast, and brightness similarity. Finally, the radiation dose monitoring is also required during the quality control process. It is the systematic measurement and analysis of radiation exposure absorbed by persons during treatments that use ionizing radiation. It is vital to keeping secure and successful techniques for imaging purposes, therapy, and other interdisciplinary radiation-based treatments.

Physicians, technicians, and other medical professionals need to work together to ensure top-notch medical image preparation. Regular inspections and ongoing training promote an environment of quality and constant advancement in the use of medical imaging. Furthermore, compliance with laws and regulations is required to achieve quality assurance standards.^6,11

Structure data

It is the process of arranging and classifying medical information in a uniform way that enables retrieval, storage, analysis, and exchange. It is essential for ensuring stability, interoperability, and effectiveness across multiple healthcare networks.⁶ Organizing medical imaging datasets is essential for efficient storage, retrieval, and analysis in health care organizations. The DICOM format is widely used for storing and exchanging data associated with medical imaging. DICOM images are frequently stored in a file format that follows the DICOM standard. These files may contain extensions like.dcm or.dicom. it includes metadata such as patient history, imaging type, acquisition settings, and so on. This metadata is essential to properly order and analyze the photos.^11,13

In addition, medical images must be organized systematically at the patient, study, series, and occurrence levels. Likewise, employ a uniform file name approach to make it easy to find and handle image files. PACS systems have been developed to efficiently organize and make available DICOM images.^6,46,54 Medical personnel can successfully handle medical images by following prescribed guidelines, arranging data hierarchically, and incorporating metadata. This structured strategy optimizes medical procedures, research activities, and cross-institutional collaboration.⁴⁶

Labelling data

Labelling medical images means annotating or classifying regions of interest in the image to offer ground truth information for AI algorithms training/testing or research purposes.^6,46 Image annotation is one of the basic problems in image labelling.

It consists of regions of interest, landmarks, and label classification. Selecting appropriate annotation tools for each annotation class is an additional vital component of data labelling. These tools might vary from simple sketching tools for identifying regions to more advanced tools for 3D volumetric annotations.⁶

Likewise, uniformity in labelling and multidimensional annotations are challenging aspects of data labelling. To produce reliable and consistent information, annotators must employ a uniform labelling mechanism. Define annotation criteria or utilize standardized taxonomies when appropriate, and experiment with annotating medical images from various imaging types, such as MRI, x-ray, or CT scans, to create large datasets for algorithm training/testing or research studies. Another challenging area of data labelling is incorporating metadata annotations such as patient history, imaging parameters, and acquisition information into the labelled dataset. This information enhances the contextualization of the images. Furthermore, documentation of the labelling process, including guidelines, definitions, and any obstacles encountered, is essential throughout the data labelling process. This documentation is useful for future use and sharing the dataset with others.

Labelling medical image data is an important step in testing and validating AI algorithms for analyzing medical images. Whether for training deep learning models or supporting clinical workflows, accurate and standardized annotations contribute to the reliability and effectiveness of medical image datasets. A comprehensive list of data annotation tools has also been provided by research studies.^55,56 Nevertheless, Table 5 also offers a variety of open-source platforms and applications focused on medical images.

Data curation tools for medical imaging

Medical image data curation tools are advanced software applications or platforms designed to assist in the organization, management, integrity, annotation, verification, extraction, and quality control of medical image datasets. These tools play a crucial role in preparing medical imaging data for research, training, and clinical applications. Without proper data curation techniques, AI algorithms may exhibit low efficiency and accuracy, resulting in unsatisfactory outcomes and, in some situations, failure. To present a comprehensive and informative collection of curation tools, we focus on those with general uses that address common use cases in medical imaging. For example, Table 6 illustrates widely used curation tools. Data curation tools are essential for reviewing, detecting, avoiding, and addressing shortcomings in datasets.⁶ As a result, without data curation techniques, possible issues such as errors from untrustworthy data, bias introduction, and doubt about the validity of prediction outcomes may arise later in the AI development process.

Table 5 highlights some prominent data curation tools available to researchers, clinicians, and healthcare professionals. Choosing the right data is critical for training and evaluating AI algorithms.

An effective data curation tool must be able to clean, arrange, and locate relevant data for particular tasks. This requires handling huge datasets and filtering data based on specific criteria. Likewise, comprehending and interpreting large datasets needs unique data visualization.

An effective tool should be able to provide data in a variety of formats, such as graphs, tables, and images, and allow the user to modify these visualizations to meet their specific needs. A data curation system should be able to deal with a wide range of data types, including images, DICOM, and BIDS, as well as all labelling standards such as boundaries, grouping, and polyline.

Furthermore, data curation tools should have a simple and user-friendly interface, as they are frequently used by both technical and non-technical stakeholders. According to the individual demands and requirements of a project or study, multiple technologies may be employed to successfully organize and curate medical data for a range of clinical and scientific applications. Some important free medical imaging software are discussed below.

Open-source software for medical imaging

The research community has shown a strong interest in open-source software, making substantial contributions to the development of publicly accessible software for a wide range of applications. This includes image processing tasks essential to AI research, such as anonymization, curation, categorization, and labelling of medical images.

For example, ‘ImageJ’ is a multiplatform, Java-based image processing and analysis tool.⁵² It is freely available in the public domain and does not require a license. It supports various file formats using freely accessible plugins. ImageJ offers a wide range of capabilities for manipulating images, such as image filtering, edge recognition, sharpening, and morphological processing. Additionally, it provides analysis tools for computing regions, it also includes analysis tools for analyzing regions, boundaries, and angles on specified regions. In addition, ImageJ can natively manage multidimensional data, such as image stacks obtained from MRI scans.

Medical Image Interaction Toolkit (MITK) is an open-source program developed using the Insight Toolkit (ITK).⁵² The software offers intuitive tools for both manual and computer-assisted image classification. It is compatible with all commonly used file formats in medical imaging and has a multi-window interface that makes it easy to view and interact with the images. The integrated semi-automatic tool employs active contour techniques. In addition, the latest iterations of the latest version of MITK known as ITK-Snap, now incorporate registration functionality, which improves the handling of multimodal images. In addition, it provides a decentralized segmentation solution that allows for cloud-based segmentation using algorithms given by the web developer community.

The Open Health Imaging Foundation (OHIF) (https://ohif.org/) Viewer is a web-based open-source platform of medical imaging.⁵⁷ The objective is to provide a basic framework for constructing advanced imaging applications. The purpose of this open-source software is to quickly load huge radiology trials by retrieving information in advance and streaming the necessary imaging pixel data as needed. The OHIF enables users to develop web-based imaging applications without the need to repeatedly build essential viewer functionality for every new application.

Several open-source applications and libraries have been developed specifically for medical image processing. Notable examples include DIPlib (https://diplib.org) and Icy (https://icy.bioimageanalysis.org/). However, for the sake of conciseness, they are not elaborated upon in this paper.

Discussion and future directions

According to various healthcare professionals,^47,58 AI is profoundly changing the field of medicine, with a particularly significant impact on medical imaging. AI has shown tremendous potential to outperform humans in certain operations, such as image segmentation. Furthermore, AI offers vital insights into the medical process of decision-making. Without AI, it would have been exceedingly difficult, almost impractical, to optimally combine and extract this information. The advancement of AI owes much to the growing availability of (publicly available) medical imaging databases. These images serve as critical inputs for AI models that retrieve the most relevant attributes, which aid in the identification of anatomical structure boundaries and the prediction of disease. Nevertheless, before this stage, it is imperative to adequately prepare medical images to ensure best utilization and maximize their abilities in developing AI and assessment.

As presented in this study, during the last decade, various publicly available medical image databases and open-source tools have evolved to encourage established standards for preparing data for clinical imaging. Nonetheless, significant issues remain, requiring ongoing attention and research, as detailed below.

Image de-identification is critical for protecting patient privacy. In recent years, many acts, such as GDPR and HIPPA, have been revised, necessitating the regular alignment of image de-identification tools with these changes. As a result, techniques for automating this process (which is now done semi-automatically) are required, as is validation that these de-identification tools effectively satisfy regulatory guidelines. For example, when developing a 3D reconstruction of the head, it is critical to avoid revealing the individual's identification. As a result, certain spatial information, such as facial features, should be erased. Nevertheless, the challenge lies in removing identifiable facial features while maintaining essential scientific knowledge without modification. This offers a dilemma, especially in conditions such as neck cancer and radiation therapy planning, where essential data is currently compromised.

Data curation is a crucial phase that ensures the data is well-organized and managed. After defining the data collection procedure objectively, we should focus on improving the data quality. This can be accomplished by creating standards and rules across the entire process of medical image preparation, spanning from the de-identification phase to the data annotation stage, with a special emphasis on data curation. Furthermore, data collection operations are critical for studies on AI in clinical imaging since they enable the formation of standards to assess AI across numerous centres and scanners. There is an urgent need for automated tools and standards to evaluate image quality, especially for quantitative studies. Additionally, the development of methods for automatically identifying and rectifying image artifacts will be required to ensure a consistent level of quality in images used to train AI algorithms. Encouraging such initiatives to promote quality standards and tools is critical for assuring the dependability of image labelling, annotation, and attributes.

Image annotations play a crucial role in ensuring the accurate training of AI algorithms and should be conducted meticulously. Nevertheless, achieving precise delineations or annotations poses significant challenges and is exceedingly time-consuming, particularly in 3D imaging modalities. For clinicians, annotating the thousands of images to train AI algorithms could be a challenging and impractical task. Both existing and future public and collaborative annotation technologies have tremendous value in capturing the diversity of annotations generated by multiple physicians.

Aside from diving into data preparation, the major focus of this study, it is important to anticipate the possible future trends of AI for clinical imaging, which include: (a) data augmentation, (b) ethical considerations regarding AI, (c) federated learning, and (d) uncertainties estimation.

Data augmentation has emerged as a promising approach within AI to enhance the data preparation phase. Cutting-edge data augmentation methods span from fundamental strategies employing practical geometric transformations, color adjustments, cropping, flipping, and noise injection,⁴⁸ to more sophisticated techniques.⁵⁹ Federated Learthe futurening is a cutting-edge technique that has been promoted in medical research to protect patient privacy while simultaneously improving the imaging datasets utilized by AI algorithms.⁶⁰ In the conventional approach, de-identified data are moved from the hospital (or silo) to a central storage system. However, with federated learning, the data remain within the hospital while the algorithm can be trained locally at multiple locations. Figure 3 shows the layout of existing versus federated learning mechanisms.

Figure 3.

Existing versus federated learning (a). Today's AI model development involves transferring de-identified data to a centralized storage system. (b). Federated learning may be used in the future.

Another key component in medical sciences is the usage of AI tools. The study presented in^61,62 addresses a significant concern, emphasizing that the ethical application of these tools in the medical sciences should aim to improve well-being and reduce suffering. As discussed earlier, numerous factors influence the data preparation process and its quality during training. As a result, alongside reliability and accuracy, the prediction confidence level of AI algorithms must be evaluated for image analysis. Uncertainty estimation holds particular significance given the imperfect nature of data preparation. Healthcare professionals should be notified of high uncertainty values so that they can incorporate this data in their final decisions. We are hopeful that this emerging research area will increase the applicability of AI in real-world scenarios by improving the reliability of methods that are currently seen as black boxes.

Conclusion

In this paper, we reviewed an in-depth examination of the publicly accessible datasets in the field of medical imaging. This work also determined the methods required for preparing medical imaging data for the development of AI algorithms and emphasized current limitations in dataset curation. This phase must be completed before starting the design or deployment of any AI algorithm. Furthermore, this study explored inventive strategies to address the challenge of data availability, offered a detailed overview of data curation tools. The provided organized explanation provides researchers and clinicians with an in-depth guide to selecting from the many currently accessible tools for preparing clinical images prior to implementing AI methods. The paper concluded with an insightful discussion and analysis of challenges in medical image analysis, along with potential future directions in the field.

In addition to the primary focus of this work, which was data preparation, it is imperative to anticipate emerging trends in AI for clinical imaging, such as data augmentation, ethical issues surrounding AI, federated learning, and uncertainty estimation.

Footnotes

Acknowledgments

This study is supported via funding from Prince Sattam bin Abdulaziz University, Alkharj, Saudi Arabia.

ORCID iD

Sajid Ullah Khan

Authors contribution

Abdulrahman Alabduljabbar (Data curation; Writing – original draft; Writing – review & editing); Sajid Khan (Conceptualization; Data curation; Writing – original draft); Anas Alsuhaibani (Formal analysis; Investigation; Project administration; Resources; Writing – review & editing); Fahdah Almarshad (Data curation; Formal analysis; Methodology; Validation); Youssef N Altherwy (Formal analysis; Funding acquisition; Project administration; Software; Supervision; Validation; Visualization).

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Thanks to Prince Sattam University for funding via project number ((PSAU/2024/01/29710).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

Data may be provided by request to corresponding author.

References

Patel

Shortliffe

Stefanelli

, et al. The coming of age of artificial intelligence in medicine. Artif Intell Med 2009; 46: 5–17.

Gigerv

. Machine learning in medical imaging. J Am College Radiol 2018; 15: 512–520.

Savadjiev

Chong

Dohan

, et al. Demystification of ai-driven medical image interpretation: past, present and future. Eur Radiol 2019; 29: 1616–1624.

Wang

Jin

Yan

, et al. AI-assisted CT imaging analysis for COVID-19 screening: building and deploying a medical AI system. Appl Soft Comput 2021; 98: 106897.

Shi

Miao

Schoepf

, et al. A clinically applicable deep-learning model for detecting intracranial aneurysm in computed tomography angiography images. Nat Commun 2020; 11: 6090.

Zhu

Hua

, et al. A systematic collection of medical image datasets for deep learning. ACM Comput Surv 2023; 56: 1–51.

Mao

Chen

Huo

, et al. Altered resting-state functional connectivity and effective connectivity of the habenula in irritable bowel syndrome: a cross-sectional and machine learning study. Hum Brain Mapp 2020; 41: 3655–3666.

Liu

Logan

, et al. Learning the dynamic treatment regimes from medical registry data through deep Q-network. Sci Rep 2019; 9: 1495.

Simpson

Antonelli

Bakas

, et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv 2019; 1902.09063.

10.

Tajbakhsh

Jeyaseelan

, et al. Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. Med Image Anal 2020; 63: 101693.

11.

Willemink

Koszek

Hardell

, et al. Preparing medical imaging data for machine learning. Radiology 2020; 295: 4–15.

12.

de Araujo

Hardell

Koszek

, et al. Data preparation for artificial intelligence. In: De Cecco

van Assen

Leiner

(eds) Artificial intelligence in cardiothoracic imaging. Contemporary medical imaging. Cham: Humana, 2022, pp.37–43.

13.

Denner

Scherer

Kades

, et al. Efficient large scale medical image dataset preparation for machine learning applications. In: Bhattarai B, Ali S, Rau A, et al. (eds) Data Engineering in Medical Imaging. DEMI 2023. Lecture Notes in Computer Science, vol 14314. Cham: Springer, 2023, pp.46–55.

14.

The Mid Research Network, http://mrn.org/micis (1998, accessed 23 December 2023).

15.

Image & Data Archive, https://ida.loni.usc.edu/login.jsp (2023, accessed 24 December 2023).

16.

Pilot European Image Processing Archive, http://peipa.essex.ac.uk/info/mias.html (2012, accessed 24 December 2023).

17.

Cancer Imaging Program (CIP), https://www.cancerimagingarchive.net (accessed 24 December 2023).

18.

The Open Access Series of Imaging Studies, https://www.oasis-brains.org/ (2007, accessed 25 December 2023).

19.

Bio-Image Informatics, https://bioimage.ucsb.edu/research/biosegmentation (2018, accessed 03 February 2024).

20.

Introduction - Grand Challenge, http://grand-challenge.org (2012, accessed: 04 February 2024).

21.

ADNI, Alzheimer's Disease Neuroimaging Initiative, https://usc.edu (2022, accessed 26 December 2023).

22.

The Whole Brain Atlas, https://www.med.harvard.edu/aanlib/ (2004, accessed 26 December 2023).

23.

Ranftl

Lasinger

Hafner

, et al. Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Anal Mach Intell 2020; 44: 1623–1637.

24.

Breast Cancer Digital Repository (BCDR), https://bcdr.eu/ (2012, accessed 23 December 2023).

25.

The National Library of Medicine's Visible Human Project, https://www.nlm.nih.gov/research/visible/visible_human.html (accessed 02 January 2024).

26.

Digital Slide Archive DSA, https://digitalslidearchive.github.io/digital_slide_archive/ (accessed 03 February 2024).

27.

Computer Vision and Pattern Recognition Group, http://www.eng.usf.edu/cvprg/ (2018, accessed 03 February 2024).

28.

National Biomedical Imaging Archive, https://wiki.nci.nih.gov/display/NBIA/NBIA+CBIIT+Instance+Retirement+Notice (2022, accessed 04 February 2024).

29.

UK Biobank, https://www.ukbiobank.ac.uk/ (2007, accessed 04 February 2024).

30.

Kaggle, https://www.kaggle.com/ (2010, accessed 06 February 2024).

31.

, Medical Segmentation Decathlon, http://medicaldecathlon.com/ (2022, accessed 06 February 2024).

32.

Open Neuro, https://openneuro.org/ (2011, accessed 06 February 2024).

33.

NeuroImaging Tools & Resources Collaboratory, https://www.nitrc.org/ (2006, accessed 07 February 2024).

34.

Federal Interagency Traumatic Brain Inquiry Research, https://fitbir.nih.gov/ (accessed 08 February 2024).

35.

CQ500 Dataset, http://headctstudy.qure.ai/dataset (2017, accessed 08 February 2024).

36.

The National Institute of Mental Health Data Archive (NDA), https://nda.nih.gov/ (2022, accessed 08 February 2024).

37.

The Connectome Coordination Facility (CCF), https://www.humanconnectome.org/ (2011, accessed 08 February 2024).

38.

Optimum Mammography Imaging, https://medphys.royalsurrey.nhs.uk/omidb/ (2008, accessed 09 February 2024).

39.

Cardiac Atlas Project (CAP), https://www.cardiacatlas.org/ (accessed 10 February 2024).

40.

VIA Group Public Databases, https://www.via.cornell.edu/databases/ (2019, accessed 10 February 2024).

41.

Valencia Region Image Bank (BIMCV), https://github.com/BIMCV-CSUSP/BIMCV-COVID-19 (accessed 10 February 2024).

42.

Retinal Image Datasets, https://blogs.kingston.ac.uk/retinal/ (2023, accessed 10 February 2024).

43.

Illuminate InSight, https://goilluminate.com/solution/insight/ (2019 accessed 12 September 2022).

44.

mPower Clinical Analytics for medical Imaging, https://www.nuance.com/healthcare/diagnostics-solutions/radiology-performance-analytics/mpower-clinical-analytics.html (accessed 13 February 2024).

45.

Huang

. PACS and imaging informatics: basic principles and applications. Switzerland: John Wiley & Sons, 2011.

46.

Bick

Lenzen

. PACS: the silent revolution. Eur Radiol 1999; 9: 1152–1160.

47.

Pinto dos Santos

Giese

Brodehl

, et al. Medical students’ attitude towards artificial intelligence: a multicentre survey. Eur Radiol 2019; 29: 1640–1646.

48.

Shorten

Khoshgoftaar

. A survey on image data augmentation for deep learning. J Big Data 2019; 6: 1–48.

49.

Williams

Krupinski

Strauss

, et al. Digital radiography image quality: image acquisition. J Am Coll Radiol 2007; 4: 371–388.

50.

Nugroho

Hidayat

Nugroho

. Artifact removal in radiological ultrasound images using selective and adaptive median filter. In: Proceedings of the 3rd International Conference on Cryptography, Security and Privacy, 2019, pp.237–241.

51.

Khan

Ullah

, et al. A review of airport dual energy X-ray baggage inspection techniques: image enhancement and noise reduction. J X-Ray Sci Technol 2020; 28: 481–505.

52.

Galbusera

Cina

. Image annotation and curation in radiology: an overview for machine learning practitioners. Eur Radiol Exp 2024; 8: 11.

53.

Cheng

Chen

, et al. Light-guided and cross-fusion U-net for anti-illumination image super-resolution. IEEE Trans. Circ Syst Video Tech 2022; 32: 8436–8449.

54.

Aryanto

Oudkerk

van Ooijen

. Free DICOM de-identification tools in clinical research: functioning and safety of patient privacy. Eur Radiol 2015; 25: 3685–3695.

55.

Rebinth

Kumar

. Importance of manual image annotation tools and free datasets for medical research. J Adv Res Dyn Control Syst 2019; 10: 1880–1885.

56.

Hanbury

. A survey of methods for image annotation. J Vis Lang Comput 2008; 19: 617–627.

57.

Open Health Imaging Foundation, https://ohif.org/ (2015, accessed 11 February 2024).

58.

Diaz

Guidi

Ivashchenko

, et al. Artificial intelligence in the medical physics community: an international survey. Phys Med 2021; 81: 141–146.

59.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial nets. Proceedings of the 27th International Conference on Neural Information Processing Systems 2014; 2: 2672–2680.

60.

McMahan

Moore

Ramage

, et al. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, 20–22 April 2017, pp.1273–1282.

61.

Geis

Brady

, et al. Ethics of artificial intelligence in radiology: summary of the joint European and north American multisociety statement. Can Assoc Radiol J 2019; 70: 329–334.

62.

Goyal

Apolo

Berman

, et al. ENABLE (Exportable notation and bookmark list engine): an interface to manage tumor measurement data from PACS to cancer databases. J Digit Imaging 2017; 30: 275–286.

Medical imaging datasets,preparation,and availability for artificial intelligence in medical imaging

Abstract

Background

Objective

Methods

Results

Conclusions

Keywords

Introduction

Medical image datasets

Data preparation

Ethical process

Accessing data

Querying data

Image de-identification

Data storage

Quality control

Structure data

Labelling data

Data curation tools for medical imaging

Open-source software for medical imaging

Discussion and future directions

Conclusion

Footnotes

Acknowledgments

ORCID iD

Authors contribution

Funding

Declaration of conflicting interests

Data availability

References