Abstract
In recent years, the transformer model (the neural network behind ChatGPT) garnered attention for its capabilities over convolution-based neural networks (CNNs) in natural language processing (NLP) tasks. Their success is attributed to their ability to learn long-range dependencies and spatial correlations that retain contextual information. As an extension, the application of transformers to images (called vision transformers, or ViT in short) was introduced and was shown to achieve impressive results over CNNs for various computer vision tasks. Deep learning (DL) networks have been used for various medical image analysis tasks that are based mostly on CNNs. Researchers have conducted studies on ViTs for the medical image analysis and have found that their results are comparable and sometimes exceed those of CNNs. However, there were few areas where ViTs demonstrated low performance compared with CNNs. This review article analyzes whether ViTs have the potential to replace the current state-of-the-art CNNs in various medical imaging tasks in oncology. We discuss various cancer studies that use both CNNs and ViTs, both individually and their hybrid forms, their performances, and their merits and demerits, and we provide potential solutions to various drawbacks found in ViTs and finally future research directions. For the benefit of analysis, this review considers four cancer types, namely, skin cancer, lung cancer, breast cancer, and prostate cancer.
Get full access to this article
View all access options for this article.
