0% found this document useful (0 votes)

25 views

Paper_27-Recent_Advances_in_Medical_Image_Classification (4)

The paper reviews recent advancements in Medical Image Classification (MIC), emphasizing the integration of artificial intelligence and explainable AI to enhance diagnostic accuracy and efficiency. It categorizes solutions into three levels: basic models, task-specific models, and real-world applications, highlighting the importance of transparency and explainability in building trust among stakeholders. The study also addresses challenges in MIC, such as data scarcity and model reliability, while proposing innovative approaches like Medical Vision-Language Models and few-shot learning to improve classification systems and patient outcomes.

Uploaded by

NGOC LY QUOC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Paper_27-Recent_Advances_in_Medical_Image_Classification (4)

Uploaded by

NGOC LY QUOC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 15, No. 7, 2024

Recent Advances in Medical Image Classification

Loan Dao, Ngoc Quoc Ly
Dept. of Computer Vision and Cognitive Cybernetics_University of Science, VNUHCM, Ho Chi Minh, Vietnam, Viet Nam
National University, Ho Chi Minh City, Vietnam

Abstract—Medical image classification is crucial for diagnosis necessitate robust and reliable MIC systems. Transparency and
and treatment, benefiting significantly from advancements in explainability are crucial for building trust among stakeholders.
artificial intelligence. The paper reviews recent progress in the Explainable AI (XAI) addresses this need by providing insights
field, focusing on three levels of solutions: basic, specific, and
applied. It highlights advances in traditional methods using deep into the decision-making process of MIC models, allowing
learning models like Convolutional Neural Networks and Vision physicians to understand the rationale behind classifications
Transformers, as well as state-of-the-art approaches with Vision- and make informed decisions.
Language Models. These models tackle the issue of limited labeled 3) Advancements in MIC: Recent advancements in MIC
data, and enhance and explain predictive results through have significantly enhanced its capabilities. Large-scale
Explainable Artificial Intelligence. Medical Vision-Language Models (Med-VLMs) trained on
Keywords—Medical Image Classification (MIC); Artificial extensive datasets of image-caption pairs enable a deeper
Intelligence (AI); Vision Transformer (ViT); Vision-Language understanding of visual information, leading to more accurate
Model (VLM); eXplainable AI (XAI) and generalizable models. Additionally, novel network
architectures like transformers and multi-task learning
I. INTRODUCTION approaches have further improved performance and efficiency.
Medical Image Classification (MIC), a crucial integration of Few-shot and zero-shot learning have also made significant
Artificial Intelligence (AI) and Computer Vision (CV), is contributions to MIC. Few-shot learning allows models to
revolutionizing image-based disease diagnosis. By categorizing classify images with minimal labeled examples, beneficial in
medical images into specific disease classes, MIC enhances fields where obtaining large labeled datasets is challenging.
diagnostic accuracy and efficiency. Utilizing various imaging
Zero-shot learning enables models to classify images from
modalities like X-rays, CT scans, MRI, and ultrasound, MIC
systems cater to specific clinical needs. Incorporating state-of- unseen classes by leveraging knowledge transfer from related
the-art technologies, MIC optimizes classification accuracy, tasks. Combined with Explainable AI (XAI) techniques, these
leading to precise diagnoses and improved patient care. approaches not only explain results and increase model
reliability but also optimize outcomes, enhancing system
1) The importance of MIC: The ability to interpret medical accuracy and performance. This comprehensive understanding
images accurately and efficiently is crucial for timely and and improved reliability facilitate their integration into clinical
effective patient care. However, manual image analysis can be practice with high confidence and precision, ultimately leading
time-consuming and prone to human error. MIC, leveraging AI to better patient outcomes and more efficient healthcare
and CV, offers automated analysis and classification of medical processes.
images, leading to several benefits: 4) Exploring MIC across three levels of solution: To fully
a) Improved diagnostic accuracy: MIC systems can grasp the current state of MIC, this paper delves into three
detect subtle patterns and features at the pixel level that may be distinct levels:
missed by human observers, leading to more accurate a) Level 1: Basic Models: This level examines the
diagnoses. fundamental theoretical models including MIC, including
b) Reduced workload for physicians: Automating image learning models, basic network architectures, and XAI
analysis frees up valuable time for physicians, allowing them to techniques.
focus on patient interaction and complex decision-making. b) Level 2: Task-Specific Models: This level explores
c) Enhanced efficiency: MIC systems can process large specific theoretical models and network architectures tailored
volumes of images quickly, leading to faster diagnoses and to particular MIC tasks, such as single-task and multi-task
treatment decisions. classification.
d) Improved patient outcomes: Ultimately, the improved c) Level 3: Applications: This level surveys prominent
accuracy and efficiency of MIC contribute to better patient applications of MIC within the medical community,
outcomes and overall healthcare quality. highlighting recent research trends and real-world
2) Challenges and the need for transparency: While MIC implementations.
offers immense potential, challenges remain. Hospital 5) Contributions and structure: This article makes several
overload, physician burnout, and the risk of misdiagnosis key contributions:

266 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

a) Comprehensive review: It provides a thorough and TABLE I. LIST OF COMMON ABBREVIATIONS

systematic review of recent advancements in MIC, offering Abbreviation Full Form
valuable insights for researchers and practitioners.
AI Artificial Intelligence
b) Highlighting key developments: It identifies and
discusses significant breakthroughs, including VLMs, CAD Computer-Aided Diagnosis
transformer-based architectures, multitask models, and CNN Convolutional Neural Network
progress in XAI, which not only explain prediction results but
also enhance the performance of MIC. Notably, recent CV Computer Vision
advancements in zero-shot learning and few-shot learning DL Deep Learning
address data scarcity in the medical field and mitigate model
DNN Deep Neural Network
overfitting.
c) Addressing challenges and proposing solutions: It FSL Few-shot learning
explores challenges in MIC and proposes effective solutions to Med-VLM Medical Visual-Language Model
improve classification algorithms and systems.
MIC Medical Image Classification
d) Exploring current issues: It delves into pressing
problems surrounding recent advancements in MIC, providing MTL Multitask Learning
a deeper understanding of the evolving research landscape. ML Machine Learning
The remainder of the paper is structured as follows (Fig. 1): NLP Natural Language Processing
Section II overviews of recent advancements across three levels.
Sections III to V detail each level. Section VI addresses SOTA State-of-the-Art
challenges and proposes solutions. Section VII concludes and
VLM Vision-Language Model
highlights future research directions. TABLE I. lists the
abbreviations used. XAI eXplainable Artificial Intelligence
By comprehensively exploring recent advancements in MIC, ZSL Zero-shot learning
this article aims to contribute to the development of more
effective and reliable classification systems, ultimately II. OVERVIEW OF RECENT ADVANCES IN MIC ACROSS
improving patient care and outcomes. THREE LEVELS OF CLASSIFICATION SYSTEMS
This comprehensive survey demonstrates the multi-faceted This section explores the evolving landscape of MIC through
nature of medical image classification across various levels of a standard three-level classification system framework. Each
solutions, providing researchers and practitioners with a holistic level serves a distinct purpose, building upon the foundations of
view of the field's current state and future directions. By the preceding one. TABLE II. provides a comprehensive
synthesizing recent advancements in MIC across fundamental overview of recent advancements in MIC across these three
models, task-specific architectures, and real-world applications, levels, highlighting their functionalities and advantages. This
this article not only addresses current challenges but also structured approach facilitates a deeper understanding of the
contributes significantly to the ongoing research in the field, current SOTA and the interconnected nature of progress within
offering valuable insights for future developments. the field.
The proposed methods in this survey address key challenges
in MIC:
 Med-VLMs leverage visual and textual data to mitigate
limited labeled data issues, enhancing model robustness
and generalizability.
 Few-shot and zero-shot learning techniques enable
classification of rare conditions with minimal training
examples.
 Transformer-based architectures and CNN hybrids
capture both local and global features, improving
complex medical image comprehension.
 XAI integration enhances interpretability, fostering trust
and adoption in clinical settings.
Fig. 1. Overview of paper organization. These approaches represent targeted solutions to specific
MIC challenges, demonstrating the field's adaptability to clinical
needs. By addressing data scarcity, rare condition classification,
feature extraction, and interpretability, these methods contribute
to advancing AI-driven medical image analysis.

267 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

III. LEVEL 1 OF MIC (FUNDAMENTAL MODELS) full potential of MIC, ultimately leading to improved patient
Level 1 includes learning models, fundamental network care and clinical outcomes.
architectures and backbone DNN, and XAI. This level plays an 2) Multimodal learning with med-VLMs in MIC: Bridging
essential role in developing systems at the subsequent levels. the semantic gap between visual and textual information is
crucial for effective MIC. VLMs integrate Computer Vision
A. Learning Model
and Natural Language Processing, enabling a comprehensive
1) Unimodal learning in MIC: The evolution of learning understanding of medical data. This section explores the role of
models has significantly impacted the field of MIC, offering clinical and paraclinical data in Medical-VLMs (Med-VLMs)
solutions to challenges like manual data labeling and limited and surveys SOTA Med-VLMs for MIC.
generalization capacity. TABLE III. provides a concise a) Clinical and paraclinical data in Med-VLMs: To
comparison of various unimodal learning models commonly better understand the distinct roles and characteristics of
employed in MIC, highlighting their key characteristics and clinical and paraclinical data within Med-VLMs, TABLE I. It
suitability for different scenarios. Selecting an optimal learning provides a comparative analysis.
model for MIC tasks (see Table III) requires careful Clinical data provides valuable context for interpreting
consideration of data availability, labeling costs, privacy paraclinical images, while paraclinical data offers objective
requirements, and performance expectations. While supervised visualizations of potential abnormalities. Med-VLMs leverage
learning is powerful when labeled data is abundant, data both data types to enhance diagnostic accuracy and provide a
annotation limitations and privacy concerns necessitate holistic understanding of patient health.
exploring alternative paradigms. Semi-supervised, weakly-
b) State-of-the-Art (SOTA) Med-VLMs in MIC: Several
supervised, active learning, meta-learning, federated learning, advanced Med-VLMs have demonstrated remarkable
and self-supervised learning offer promising avenues to address performance in MIC tasks, utilizing sophisticated techniques
these challenges, fostering the development of more efficient such as transformer architectures, attention mechanisms, and
and generalizable MIC systems. Leveraging these diverse pre-training on large datasets. TABLE V. summarizes SOTA
approaches allows researchers and practitioners can unlock the Med-VLMs for MIC.

TABLE II. OVERVIEW OF THE THREE-LEVEL SOLUTION FRAMEWORK FOR MEDICAL IMAGE CLASSIFICATION
Level Content Specific solutions Explaination
 Unimodal learning: Supervised learning, unsupervised learning, The evolution of learning models from
semi-supervised learning, weakly supervised learning, active learning, meta- unimodal to multimodal, exemplified by
learning, federated learning, self-supervised learning. the emergence of Med-VLM, represents
 Med-VLMs: BiomedCLIP [1], XrayGPT [2], M-FLAG [3], and a significant advancement in the field.
Learning
MedBLIP [4]. Few-shot and zero-shot learning models
model
 Some remarkable methods: further enhance the ability to classify
o Few-shot learning: BioViL-T [5], PM2 [6], and DeViDe [7]. medical images with minimal or no
o Zero-shot learning: MedCLIP [8], CheXZero [9], and MedKLIP labeled data, making them effective for
[10]. rare and novel diseases.
1
 CNN: VGGNet [11], GoogleNet [12], ResNet [13], and EfficienNet Evolution of fundamental network
Architectures
[14]. architectures in image classification,
of fundamental
 GNN: Graph Convolution Networks (GCN) [15], and GAT [16]. including CNNs, GNNs, and Vision
networks and
backbone DNN  Transformer: ViT [17], DeiT [18], TransUnet [19], TransUnet+ Transformers, as well as their respective
[20], and TransUnet++ [21]. backbone DNNs.
 For CNN: LIME [22], SHAP [23], CAM-based (CAM [24],
GradCAM [25], and GradCAM++ [26]). XAI is applied for CNN Architecture and
XAI
 For Transformer: ProtoPFormer[27], X-Pruner [28], and GradCam Vision Transformer
for Vision Transformer [29].
 CNN: Unet [30], Unet ++ [31], SNAPSHOT ENSEMBLE [32], and
Specific DNN Specialized network architectures have
PTRN [33].
architectures achieved high performance in MIC.
 GNN: CCF-GNN [34] and GazeGNN [35].
and Med-VLM
for single task  Transformer: SEViT [36] and MedViT [37].
(classification)  Med-VLM: BERTHop [38], KAD [39], CLIPath [40], and Med-VLMs for MIC.
ConVIRT [41].
2 Specific DNN
architectures  CNN: Mask-RCNN-X101 [42] and Cerberus [43]. MIC is advancing with multi-tasking.
and Med-VLM  GNN: MNC-Net [44] and AMTI-GCN [45]. Classifying disease segments often excels
for multitask  Transformer: TransMT-Net [46] and CNN-VisT-MLP-Mixer [47]. over whole image analysis. The advent of
(classification  Med-VLM: GLoRIA [48], ASG [49], MeDSLIP [50], SAT [51], Med-VLMs for multi-tasking enhances
and CONCH [52], and ECAMP [53]. precision and depth of analyses.
segmentation)
 Breast Cancer [54] [55], tuberculosis [56], eye disease diagnosis Surveying prominent applications
Specific [57][58], skin cancer diagnosis [59] [60], bone disease [61] - [63], other significant to the medical community.
3
applications pathological [64] [65] Recent research trends in MIC (2020 -
 Cancer, brain, tumor, lesion, lung, breast, eye, etc. 2024) and cancer statistics for 2024

268 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

TABLE III. COMPARISON OF LEARNING MODELS IN MIC

Learning model
Data Availability Labeling cost Operating principles Balance Applications
Supervised Learns input-output mapping High performance with Tumor detection, organ
Labeled data required High
Learning from labeled data sufficient labeled data segmentation, classification
Finds patterns and structures in Lower performance,
Unsupervised Clustering similar images,
Unlabeled data only Low data without explicit useful for discovering
Learning anomaly detection
supervision underlying structures
Utilizes a combination of Higher performance than
Semi-supervised Labeled and unlabeled Classification with limited
Moderate labeled and unlabeled data to unsupervised learning
Learning data labeled data
improve model performance with less labeling effort
Weakly Weak supervision
Moderate to Learns from partially labeled or Scalability with
Supervised (coarse or image-level Image-level diagnosis tasks
low noisy data reasonable performance
Learning labels)
Balances model
Efficient use of unlabeled
Self-supervised Generates supervisory signals performance with
Unlabeled data Low data to pre-train models for
Learning from the input data itself labeling effort by
downstream tasks
leveraging unlabeled data
Small initial labeled Initially high, Actively selects the most Balances model Reducing labeling effort by
Active Learning dataset, actively selects decreases informative samples to be performance with prioritizing informative
informative samples over time labeled labeling effort images
High initially,
Balances adaptation to
potentially Learns to learn from different Efficient adaptation to new
Diverse set of tasks for new tasks with reduced
Meta-Learning low for tasks, improving adaptation to imaging modalities or
meta-training need for extensive
downstream new tasks with limited data diseases
labeled data
tasks
Varies Collaborative model
Decentralized data Collaboratively trains a global Balances model
Federated depending on training across institutions
across multiple model while keeping data performance with data
Learning data without sharing sensitive
devices/institutions localized privacy and availability
distribution data
To summary, Med-VLMs show significant potential for text model pretrained with contrastive and masked
advancing MIC by effectively integrating clinical and language modeling objectives. This approach enables
paraclinical data. Key takeaways from the surveyed models BioViL-T to learn robust representations of medical
include the effectiveness of transfer learning, model concepts by capturing both visual and temporal
optimization techniques, integration of medical knowledge, and relationships present in longitudinal data. The model's
the development of multi-task models. These advancements strength lies in its ability to transfer knowledge from
pave the way for more accurate, efficient, and comprehensive diverse sources, leading to improved performance in
diagnostic support tools in healthcare. few-shot settings.
3) Some remarkable methods  PM2 [6] introduces a novel multi-modal prompting
a) Few-shot learning in MIC: In the medical imaging paradigm for few-shot medical image classification. Its
domain, few-shot learning (FSL) techniques are crucial due to key strength lies in leveraging a pre-trained CLIP model
the scarcity of labeled data and the dynamic nature of disease and learnable prompt vectors to effectively bridge visual
patterns. FSL enables accurate classification and diagnosis and textual modalities. This approach enables PM2 to
achieve impressive performance in few-shot settings,
from a limited number of training samples, leveraging meta-
surpassing existing methods on various medical image
learning and transfer-learning principles.
classification benchmarks.
Core Principles:
 DeViDe [7] is a novel transformer-based approach that
Meta-learning: Models are trained on diverse medical leverages open radiology image descriptions to align
imaging tasks to learn a shared representation that can be quickly diverse medical knowledge sources, handling the
adapted to new tasks with few examples, optimizing for rapid complexity of associating images with multiple
adaptation to new data. descriptions in multi-label scenarios. It guides medical
Transfer learning: Pre-trained models on large medical image-language pretraining using structured medical
datasets are fine-tuned on smaller, specific datasets to improve knowledge, enabling more meaningful image and
performance on the target task, such as disease classification or language representations for improved performance in
anomaly detection. downstream tasks like medical image classification and
captioning.
Relevant Med-VLM Models:
Advantages:
 BioViL-T [5] is a self-supervised learning approach that
leverages temporal information within longitudinal  Data Efficiency: Reduces the need for large amounts of
medical reports and images to enhance performance on labeled data, making it feasible to develop models with
medical vision-language tasks. It utilizes a hybrid CNN- limited resources.
Transformer architecture for encoding visual data and a

269 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

 Flexibility: Can quickly adapt to new tasks with minimal  Flexibility: Expands the diagnostic capabilities of AI
data, which is crucial in dynamic environments like systems to include rare and novel diseases, which are
medical imaging. often not well-represented in training datasets.
Disadvantages: Disadvantages:
 Performance: May be less effective compared to models  Accuracy: Performance may be lower compared to
trained on large, fully labelled datasets. models trained specifically on the classes of interest,
particularly for highly dissimilar unseen classes.
 Complexity: Requires careful design of task sets for
training to ensure generalizability and robustness.  Dependency on Semantic Descriptions: Requires
accurate and rich semantic information to function
b) Zero-shot learning in MIC: Zero-shot learning (ZSL) effectively, which can be a limitation if such data is not
enables the classification of unseen classes by leveraging available.
semantic relationships between known and unknown classes.
ZSL's core principle is to use auxiliary information, such as Overall, few-shot and zero-shot learning models address the
textual descriptions, to bridge the gap between seen and unseen challenge of limited labeled data in medical image classification.
classes, thereby expanding AI systems' diagnostic capabilities. FSL adapts quickly to new tasks with minimal training samples,
while ZSL uses semantic relationships to diagnose rare and
Core Principles:
novel diseases. Each approach has unique advantages and
 Semantic Embeddings: Align visual features with limitations that must be considered when designing MIC
semantic representations (e.g., word embeddings) to systems. Understanding these principles is crucial for
infer the class of unseen instances by creating a shared developing effective and reliable MIC models.
space where both visual and semantic data coexist.
B. Architectures of Fundamental Networks and Backbone
 Knowledge Transfer: Utilize knowledge from known DNN
classes to predict the properties of unknown classes MIC has significantly shifted from traditional machine
based on their semantic descriptions, effectively learning methods to deep learning approaches. This review
transferring learned information across domains. focuses on fundamental DL architectures commonly used in
Common Models: MIC, including Convolutional Neural Networks (CNNs), Graph
Neural Networks (GNNs), and Transformers. These
 MedCLIP [8] uses contrastive learning from unpaired architectures have shown remarkable efficacy in automatically
medical image-text data to improve representation learning hierarchical feature representations and achieving state-
learning and zero-shot prediction, achieving strong of-the-art performance in various MIC tasks.
performance even with limited data.
1) Convolutional Neural Networks (CNNs): CNNs have
 CheXZero [9] is a deep learning model specifically for become the cornerstone of MIC due to their ability to
chest X-ray classification, utilizing pre-trained CNNs automatically learn hierarchical feature representations.
and fine-tuning on labelled data to achieve high accuracy Inspired by the human visual cortex (Fig. 2 [66]), CNNs excel
in identifying thoracic diseases. at capturing local features within images, making them ideal for
 MedKLIP [10] leverages medical knowledge during tasks like disease detection, organ segmentation, and anomaly
language-image pre-training in radiology, enhancing its identification. This section explores the core components of
ability to handle unseen diseases in zero-shot tasks and CNNs and their contributions to feature extraction and
maintaining strong performance even after fine-tuning. classification, followed by a review of popular CNN
These models represent significant advancements in medical architectures and their advancements in MIC.
image classification, demonstrating impressive results and a) Core components of CNNs: TABLE VI. summarizes
addressing the unique challenges posed by healthcare data. the core components of a CNN and their functions in feature
Advantages: extraction and class prediction.
These components work synergistically to enable CNNs to
 Scalability: Enables classification of novel classes learn intricate features from medical images, leading to accurate
without prior training examples, making it highly classification.
scalable and versatile.

270 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

TABLE IV. COMPARISON OF CLINICAL AND PARACLINICAL DATA

Feature Clinical Data Paraclinical Data

Source Direct interaction with healthcare professionals Direct interaction with healthcare professionals
Nature Text-based (medical history, symptoms, physical exam findings) Image-based (internal body structures)
Role Subjective assessment of patient condition Objective visualization of abnormalities
Usage in VLMs Provides context and complements image interpretation Serves as primary input for image analysis and classification

TABLE V. PROMINENT MED-VLMS IN MIC

Encoders and Pre-trained Implementation Performance

Med-VLMs Principle Key Contributions
fusion method objectives Details Metrics
Pcam: 73.41,
Language encoder: Superior zero-shot and few-
Cross-modal Toilored batch size LC25000 (lung):
Adapts CLIP PubMedBERT shot classification.
BiomedCLIP global and patch dropout 65.23, LC25000
for biomedical Vision encoder: ViT Outperfrms SOTA models on
[1] contrastive strategy for (colon): 92.98,
domains Fusion method: late diverse iomedical dataset,
learning efficiency. TCGA-TIL: 67.04,
fusion robust image encoder.
RSNA: 78.95

Language encoder:
Summarizes Hybrid: Integration of medical
Vicuma
chest X-rays Image-report Interactive knowledge through interactive
Vision encoder: Fine-tuned Vicuna
XrayGPT [2] by aligning matching and summaries from summaries, enhancing the
MedCLIP on curated reports
MedClip with mixed radiology reports interpretability and usability
Fusion method: early
Vicuna. objectives of diagnostic results.
fusion

Frozen
Language encoder: Hybrid:
language
CXR-BERT (frozen) Image-text Potential for Outperforms existing Model optimization and
model,
Vision encoder: contrastive classification, MedVLP efficiency, achieving high
M-FLAG [3] orthogonality
ResNet50 learning and segmentation, object approaches, 78% performance with reduced
loss for
Fusion method: late language detection parameter reduction parameters.
harmonized
fusion generative
latent space.

Language encoder:
Bootstraps Efficient 3D medical image
BioMedLM Global and
VLP from 3D Combines pre- SOTA zero-shot processing facilitates
Vision encoder: ViT- local
MedBLIP [4] medical trained vision and classification of classifying complex
G14 (EVA-CLIP) contrastive
images and language models Alzheimer’s disease conditions with minimal
Fusion method: late learning
texts labeled data.
fusion

b) Popular CNN architectures: A Historical Perspective: a) GNN variants and their advantages: Two prominent
The evolution of CNN architectures has been driven by Graph Neural Network (GNN) variants demonstrate
continuous innovation in addressing challenges and improving considerable potential in image classification: Graph
performance. TABLE VII. highlights key milestones: Convolutional Networks (GCNs) and Graph Attention
CNN architectures offer unique advantages and have Networks (GATs).
demonstrably excelled in image classification tasks. Their
capacity to learn intricate features and generalize to new data
underscores their value in advancing image analysis and related
research fields. Ongoing research promises further innovations
in CNN architecture and training methodologies, leading to
increasingly accurate and efficient image classification systems.
This progress holds particular significance for the medical
domain, where precise image classification can directly impact
diagnosis and patient care.
2) Graph Neural Networks (GNNs): leveraging
relationships in image data
GNNs offer a unique approach to image classification by
representing images as graphs and exploiting the relationships
Fig. 2. Illustration of convolutional neural networks (CNNs) inspired by
between pixels or image regions. This allows GNNs to capture biological visual mechanisms [66].
contextual information and learn more robust representations
compared to traditional CNNs.

271 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

TABLE VI. CNN COMPONENTS AND THEIR ROLES IN MIC dealing with data where relationships between elements are
Component Function Role in MIC crucial. Their ability to leverage graph structures and learn
Convolutional
Applies filters to extract Hierarchical feature contextual representations opens new avenues for improving
local features (edges, extraction, capturing accuracy and robustness in image classification tasks.
Layer
textures) spatial relationships.
Activation Introduces non-linearity Enables complex decision 3) Transformers: Expanding Horizons in Image
Function for learning complex boundaries for accurate Classification
(e.g., ReLU) patterns. classification.
Down-samples feature Improves robustness to Transformers, initially designed for NLP, have emerged as
Pooling Layer
(e.g., Max
maps to reduce image variations and powerful contenders in image classification. Unlike CNNs,
dimensionality and reduces computational transformers leverage self-attention mechanisms to capture
Pooling)
improve invariance. cost. global context and long-range dependencies within images,
Fully- Integrates local features
Connected into global patterns for
Combines learned features leading to richer feature representations.
for final class prediction.
Layer image understanding.
a) Contributions of transformers to image classification
Converts outputs into Provides class probabilities
Softmax
probability distribution for determining the most  Feature extraction: Vision transformers (ViTs [17]) split
Layer
over predicted classes. likely class. images into patches, embed them into vectors, and
incorporate positional information. Self-attention
TABLE VII. POPULAR CNN ARCHITECTURES AND THEIR ADVANCEMENTS mechanisms then assess the importance of each patch in
Architecture relation to others, enabling the capture of global context
Advancement Key Technique and intricate features.
(Year)
Achieved SOTA
VGGNet
performance with
Small 3x3 filters for deeper  Class prediction: A classification head on top of the final
(2014, [11]) networks transformer encoder layer predicts the image class based
increased depth
GoogleNet
Further reduced error Inception modules, 1x1 on the learned global context. Parallel processing of
rates with efficient convolutions, global average patches enhances computational efficiency compared to
(2014, [12])
architecture pooling sequential CNNs.
Residual blocks with skip
ResNet Enabled training of very
connections to address b) Evolution of transformer architectures
(2015, [13]) deep networks
vanishing gradients
Compound scaling for  Vision Transformer (ViT [17]): Introduced the
EfficientNet SOTA accuracy with transformer architecture to image classification,
optimal efficiency and
(2019, [14]) fewer parameters
performance achieving impressive performance with patch-based
processing and self-attention.
 GCNs [15], by generalizing the convolution operation to
graph data, effectively capture the local graph structure  Data-efficient image Transformers (DeiT [18]):
and relationships between nodes. This capability allows Improved efficiency through knowledge distillation and
GCNs to leverage the inherent structural information efficient training strategies, achieving comparable results
within images for improved classification. with fewer resources.
 GATs [16], on the other hand, introduce an attention  Specialized variants (e.g., TransUnet [19], TransUnet+
mechanism to GNNs. This mechanism enables GATs to [20], and TransUnet++ [21]): Combine transformers
focus on relevant features within the graph, leading to with U-Net architectures for enhanced feature extraction
improved feature extraction and ultimately, enhanced and accurate segmentation in medical imaging tasks.
prediction accuracy. By selectively attending to
important features, GATs can make more informed c) Addressing challenges: Techniques like dropout,
decisions during image classification. regularization, and efficient optimization algorithms mitigate
overfitting and manage computational complexity in
b) Benefits of GNNs for Image Classification transformers.
 Modeling complex relationships: GNNs excel at In summation, the choice of architecture depends on the
capturing intricate dependencies between image specific task and dataset characteristics. CNNs excel at local
elements, leading to better understanding of image feature extraction, GNNs leverage relationships within data, and
context. transformers capture global context and long-range
dependencies. Understanding these strengths and weaknesses
 Improved feature extraction: By considering
empowers researchers to select the most appropriate architecture
relationships between nodes, GNNs can extract more
for their MIC tasks.
informative and discriminative features for
classification. C. Explainable Artificial Intelligence (XAI) in MIC
 Enhanced robustness: GNNs are less susceptible to noise XAI techniques are crucial for fostering trust and
and variations in image data due to their focus on understanding in MIC systems. Despite achieving human-level
structural information. accuracy, the integration of automated MIC into clinical practice
has been limited due to the lack of explanations for algorithmic
c) Summary: GNNs offer a valuable complementary decisions. XAI methodologies provide insights into the rationale
approach to CNNs for image classification, particularly when behind the classification results of DL models, such as CNNs

272 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

and Transformers, used in MIC tasks. By addressing the 'how' XAI methods, transitioning from those primarily designed for
and 'why' behind predictive outcomes, XAI enhances the CNNs to novel techniques tailored for Transformer
transparency and interpretability of MIC systems, contributing architectures. The following tables provide a comparative
to their improved performance and acceptance in clinical analysis of recent advancements in XAI methods applied to
settings.
CNNs (TABLE VIII. ), Transformers (TABLE IX. ) within the
1) XAI methods in CNNs and transformers: The field of MIC domain, along with techniques used to enhance system
XAI has witnessed significant advancements, particularly in the performance (TABLE X. ).
domain of MIC. This progress is evident in the evolution of
TABLE VIII. XAI METHODS FOR CNNS IN MIC

Method Principle How/Why Explanation Methodological Approach System Performance Impact

Approximates complex
Fits a simpler interpretable Enhances local interpretability
models with simpler Explains "how" by analyzing feature
LIME [22] model to perturbed samples but may not capture global
interpretable ones (e.g., linear perturbation impact
around an instance model behavior
regression)
Assigns feature contributions Explains "how" and "why" by Computes average feature Provides global and local
SHAP [23] based on game theory quantifying feature importance and contribution across all explanations, valuable for
principles. interactions possible feature subsets understanding complex models
Visualizes image regions Combines feature maps and Helps localize important
Explains "why" by highlighting
CAM [24] contributing most to a specific gradients to create a saliency features but lacks fine-grained
relevant regions
class. map details
Computes gradients of class
Improves CAM by Explains "why" through visualization
Grad-CAM score with respect to feature Offers better localization and is
incorporating gradient and "how" through contribution values
[25] maps for saliency map widely used
weights. on saliency maps
creation
Explains "why" by enhancing
Refines Grad-CAM by visualization quality and "how"
Grad- Introduces Shapley values to Provides improved visual
addressing negative values and through weighted combination of
CAM++ [26] estimate pixel contributions explanations and robustness
weight stability. positive and negative partial
derivatives

TABLE IX. XAI METHODS FOR TRANSFORMERS IN MIC

Method Principle How/Why Explanation Methodological Approach System Performance Impact

Explains "how" by utilizing
Interpretable image Employs prototype-based XAI Achieves superior performance
ProtoPFormer prototypes to capture target features
recognition using global technique to enhance ViT and visualization results
[27] and "why" by addressing the need for
and local prototypes interpretability compared to SOTA baselines
improved interpretability in ViTs
Outperforms SOTA black-box
Explains "how" by measuring each Adaptively searches layer-wise
Explainable pruning methods with reduced
X-Pruner [28] unit's contribution to class prediction threshold based on explainability-
framework for ViTs computational costs and slight
using an explainability-aware mask aware mask values
performance degradation
Can enhance ViT model fine-
Grad-CAM Visualization of VT Explains "why" by revealing focus Generates class activation maps
tuning but requires further
for ViTs [29] decision-making areas during ViT decision-making for ViTs
improvement

TABLE X. XAI TECHNIQUES FOR ENHANCING SYSTEM PERFORMANCE

Technique Description Impact on System Performance

Pruning techniques like X-Pruner that utilize XAI to guide the removal Reduces computational cost and model complexity while
Explainable Pruning
of less important model components. maintaining or improving performance.
Attention Visualizing attention mechanisms in Transformers to understand which Provides insights for model improvement and
Visualization parts of the input the model focuses on. debugging.
Feature Importance Techniques like SHAP that quantify the importance of individual Helps identify key features and potential biases, leading
Analysis features for model predictions. to improved model design and feature engineering.
Training models with adversarial examples to improve robustness and
Adversarial Enhances model robustness and performance against
generalizability. XAI methods can be used to analyze the impact of
Training adversarial attacks.
adversarial attacks and guide the development of defense strategies.

273 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

2) Discussion: The tables above illustrate the diverse range  SHAP provide a unified measure of feature importance,
of XAI methods available for both CNNs and Transformers in allowing for precise identification of influential features
MIC. While CNN-based methods like LIME, SHAP, and Grad- and potential sources of errors.
CAM variants have been widely explored, the emergence of  CAM-based Methods: These methods generate visual
Transformers has led to the development of novel techniques explanations by highlighting regions in the input image
like ProtoPFormer and X-Pruner. These methods offer unique that influence the model's predictions, making it easier to
advantages in terms of interpretability and performance spot and address inaccuracies.
improvement. Transformer-based XAI Techniques:
a) Key Observations
 Focus on Visual Explanations: Many XAI methods,  ProtoPFormer uses prototypical parts to explain
particularly those applied to CNNs, emphasize visual predictions, aiding in the identification of errors by
explanations through saliency maps and other comparing new instances with learned prototypes.
visualization techniques. This is crucial in MIC, where  X-Pruner prunes less important parts of the model,
understanding the model's focus on specific image enhancing interpretability and helping to pinpoint and fix
regions is essential for building trust and ensuring model weaknesses.
reliable diagnoses.
 GradCam for Vision Transformer adapts GradCAM for
 Evolution from Local to Global Explanations: XAI transformers, providing visual explanations that help in
methods have progressed from providing local diagnosing and correcting errors in transformer-based
explanations for individual predictions (e.g., LIME) to MIC models.
offering global interpretations of model behavior (e.g.,
SHAP). This allows for a more comprehensive Impact on MIC:
understanding of the decision-making process.  Error Detection: XAI techniques make it easier to
 Integration with Model Optimization: Techniques like identify misclassifications and understand why they
X-Pruner demonstrate the potential of integrating XAI occur, enabling targeted corrections.
with model optimization strategies like pruning. This  Model Improvement: By revealing which features and
allows for the development of more efficient and regions are most influential, XAI helps refine model
interpretable models. training and architecture, leading to better performance.
b) Future directions  Trust and Reliability: Enhanced transparency builds trust
 Developing XAI methods specifically tailored for among clinicians, ensuring that MIC systems are more
Transformer architectures: While existing techniques likely to be adopted and relied upon in clinical settings.
like Grad-CAM have been adapted for ViTs, further
research is needed to explore methods that fully leverage Some recent XAI techniques:
the unique characteristics of Transformers. Recent studies have shown that using XAI methods such as
 Combining XAI with other AI advancements: Integrated Gradients can significantly enhance the performance
Integrating XAI with areas like federated learning and of classification systems.
continual learning can lead to more robust and adaptable  A notable study by Apicella et al. (2023, [67])
medical image classification systems. investigated the application of Integrated Gradients, a
 Standardization and Benchmarking: Establishing technique from XAI, to enhance the performance of
standardized evaluation metrics and benchmarks for XAI classification models. The study focused on three distinct
methods will facilitate fair comparisons and accelerate datasets: Fashion-MNIST, CIFAR10, and STL10.
progress in the field. Integrated Gradients were employed to identify and
quantify the importance of input features contributing to
c) Enhancing performance and accuracy in MIC with the model's predictions. By analyzing these feature
XAI: XAI techniques significantly improve the performance attributions, the researchers were able to pinpoint which
and accuracy of MIC systems by providing transparency and features had the most significant impact on the model's
facilitating error detection and correction. These techniques output. The insights gained from Integrated Gradients
help identify and rectify model shortcomings, leading to more were then used to refine the model. This involved
reliable and effective MIC systems. adjusting the model parameters and structure to better
capture the critical features identified by the XAI
CNN-based XAI Techniques: method. The study demonstrated that through this
 LIME create interpretable models for individual process of feature importance analysis and subsequent
predictions, helping to identify and correct model optimization, the classification performance
misclassifications by highlighting important features. improved significantly across all tested datasets. This
approach not only enhanced accuracy but also provided
a deeper understanding of the model's decision-making
process.

274 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

 Additionally, another study by Apicella et al. (2023, Key insights:

[68]) introduced an innovative method that also
a) Adaptability and efficiency: Unet and Unet++
leveraged Integrated Gradients to boost classification
system performance. This study proposed a soft masking demonstrate adaptability to new tasks and improved
scheme, wherein the explanations generated by segmentation accuracy, though at the cost of increased
Integrated Gradients were used to create masks that parameters.
highlight important features while downplaying less b) Innovative approaches: Snapshot Ensemble and
relevant ones. The soft masking approach involved GazeGNN introduce novel methods like GradCAM and eye-
applying these masks during the training phase of the gaze data utilization, showcasing the potential of combining
machine learning model. By focusing the model's different data types and analytical techniques.
attention on the most influential features as determined c) Challenges in complexity and data requirements:
by Integrated Gradients, the training process became While architectures like PTRN and CCF-GNN show promise
more efficient and effective. The experimental results in specific tasks, they highlight the ongoing challenges of
from this study showed a marked improvement in model
computational demands and the need for extensive training
accuracy across the same datasets: Fashion-MNIST,
data.
CIFAR10, and STL10. The use of soft masks helped in
reducing noise and enhancing the signal of critical d) Future directions: The evolution from CNN-based
features, thereby leading to better generalization and architectures to incorporating GNN and Transformer models
performance of the classification system. indicates a shift towards more complex, yet potentially more
effective methods for medical image classification. However,
All in all: XAI explanations enhance both models issues such as interpretability, computational efficiency, and
understanding and classification performance.
data availability remain critical areas for future research.
Summary of Level 1 Findings: This summary underscores the dynamic nature of deep
learning research in medical image classification, emphasizing
 Learning models have evolved from traditional
the need for continued innovation and exploration of new
supervised learning to advanced techniques like Med-
methodologies.
VLMs, few-shot, and zero-shot learning, addressing data
scarcity in medical imaging. 2) Med-VLMs for MIC: The recent rise of Med-VLMs has
 Network architectures have progressed from CNNs to greatly influenced MIC. These models utilize NLP and CV to
Transformers, with hybrid models showing promise in analyze medical images and text reports, enhancing diagnostic
capturing both local and global features. accuracy and efficiency. Table XII summarizes key Med-
VLMs in MIC, highlighting their performance in zero-shot and
 XAI methods have become crucial for enhancing model few-shot learning scenarios:
interpretability and trust in clinical settings, with
techniques like Grad-CAM and SHAP leading the way. Advancements and Impact
 The integration of these advancements has led to more Med-VLMs demonstrate remarkable progress in MIC,
robust, efficient, and interpretable models for medical particularly in scenarios with limited labeled data.
image classification
a) Zero-shot learning: Models like KAD showcase the
IV. LEVEL 2 OF MIC (TASK-SPECIFIC MODELS) ability to classify images of unseen pathologies without explicit
training, highlighting the potential for real-world clinical
Expanding on initial network architectures, the second level
applications.
focuses on specialized architectures for MIC. It takes a
comprehensive approach, combining classification with b) Few-shot learning: CLIPath and ConVIRT achieve
segmentation through multitask learning models. This broad SOTA performance with minimal labeled data, reducing the
view deepens understanding of MIC network architectures, burden of data annotation in clinical settings.
paving the way for specific applications. Future Directions
A. Recent Advances in Level 2 for Single Task The field of Med-VLMs is rapidly evolving, with ongoing
1) Specific DNN architectures for single task research exploring:
(classification): This review assesses recent advancements in a) Multi-modal learning: Integrating diverse data
DNN architectures for single-task classification in medical modalities (e.g., images, text, genomics) for a more
image analysis. It evaluates specialized architectures across comprehensive understanding of diseases.
CNNs, GNNs, and Transformers, considering methodology, b) Explainability and interpretability: Enhancing
datasets, effectiveness, advantages, and limitations. The transparency and trust in model predictions.
comparative analysis, summarized in Table XITABLE XI. , c) Domain adaptation: Adapting models to diverse
highlights key developments and their implications for MIC, clinical settings and populations.
offering a comprehensive overview of the current state-of-the-
art in the field.

275 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

Summary c) Learning robust shared representations: MTL

encourages the learning of shared features that are beneficial
Med-VLMs revolutionize MIC, promising improved
diagnosis, treatment planning, and patient care. With ongoing for both segmentation and classification tasks. These shared
research, they have the potential to transform healthcare, representations capture task-agnostic information, leading to
enabling more accurate, efficient, and personalized medicine. improved generalization and performance across multiple MIC
tasks.
B. Recent Advances in Level 2 for Multitask (Classification MTL in MIC tasks effectively tackles challenges like data
and Segmentation) scarcity, resource constraints, and the necessity for robust,
Multitask learning (MTL) is vital in MIC tasks, overcoming generalizable models. By leveraging synergies between related
individual model limitations and boosting overall performance. tasks, MTL enhances MIC systems' performance and efficiency,
A recent comprehensive study highlighted the substantial leading to better clinical decision-making and patient outcomes.
progress made in medical image segmentation using DNNs,
leading to more accurate and efficient diagnostic processes [69]. 3) Typical architecture for multitasking in MIC:
By optimizing image segmentation and classification together, Researchers have investigated different MTL configurations
MTL provides numerous advantages: like feature extraction, fine-tuning, and hybrids to match
diverse medical imaging contexts and data availability. TABLE
a) Mitigating data scarcity: MTL leverages knowledge
XIII. Surveys the latest notable DNN architectures using
transfer across related tasks, enabling models to learn from
multitasking to boost MIC performance.
complementary data sources and improve performance on the
target task, even with limited data availability. In summary, these MTL-based architectures demonstrate
b) Optimizing resource utilization: By sharing feature significant advancements in addressing data scarcity, improving
representations across tasks, MTL optimizes the use of resource efficiency, and leveraging shared representations to
computational resources, leading to more efficient model enhance medical image classification performance across
architectures and reduced computational overhead. various modalities and disease domains.

TABLE XI. SUMMARY OF KEY ARCHITECTURES FOR SINGLE TASK (MIC)

Works Method Data Effectiveness Advantanges Limitations
Limited in extracting
Supervised learning High IOU scores (92% Accurate segmentation
Unet PhC-U373: 30 images long-range information;
with encoder- for PhC-U373, 77.5% for with limited data; adaptable
(2015, [30]) DIC-HeLa: 35 images lacks explanation for
decoder architecture DIC-HeLa) to new tasks
predictions
Cell nulei: 92.63, colon
Supervised learning Reduces semantic gap; Increases parameter
Unet++ Cell nuclei, colon polyp, liver, polyp: 33.45, liver:
with redesigned skip improves accuracy and count; lacks explanation
(2018, [31]) lung nodule images 82.90, lung nodule: 77.21
paths speed for predictions
Improved IoU over Unet
Malaria Dataset: 27558
erythrocyte images with equal
Snapshot Supervised learning Timely and accurate Focused only on P.
cases of parasitized and High F1 score (99.37%)
Ensemble with EfficientNet- malaria diagnosis; uses falciparum; not other
uninfected cells. and AUC (99.57%)
(2021, [32]) B0, GradCAM GradCAM for explanations species
Source: CMC hospital in
Bangladesh
CheXpert: 0.896
Reduces cost of collecting
CheXpert: 224,316 digital CheXphoto-Monitor: Higher computation
PTRN Supervised learning natural data; eliminates
CXRs; 0.880 costs; untuned
(2022, [33]) with DenseNet-201 negative impacts of
CheXphoto: 10,507 CXR CheXphoto-Film: 0.802 hyperparameters
projective transformation
meanAUC: 0.850
Effectively analyzes
TCGA-GBMLGG, BRACS, High AUC (0.912 for Requires extensive
CCF-GNN Supervised learning pathology images;
Bladder Cancer, ExtCRC TCGA-GBMLGG) and training data; longer
(2023, [34]) with GNN represents cancer-relevant
images accuracy processing time
cell communities
Captures complex
Supervised learning Needs eye-tracking
GazeGNN High accuracy (0.832) relationships via graph
with GNN Chest X-ray: 1083 images devices for gaze data
(2023, [35]) and AUC (0.923) learning without pre-
collection
generated VAMs
Chest X-ray: 7000 chest X-ray
images (Normal or
High accuracy (94.64% Detects adversarial samples Full white-box settings
SEViT Supervised learning Tuberculosis)
for Chest X-ray) and by assessing prediction not evaluated in natural
(2022, [36]) with Transformer Fundoscopy (APTOS2019):
AUC consistency image contexts
diabetic retinopathy (DR) 3662
retina images (5 classes)
Supervised learning MedMNIST-2D: 12 biomedical Reduces computational Lacks precise
MedViT Average accuracy of
with hybrid CNN- datasets (CT, X-ray, complexity; high hyperparameter tuning;
(2023, [37]) 0.851 and AUC of 0.942
Transformer. Ultrasound, and OCT images) generalization ability employs two CNNs.

276 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

TABLE XII. PERFORMANCE OF MED-VLMS IN MIC

Zero-shot Few-Shot
Model Modality Encoders and fusion method Pre-trained objective Key Features
Learning Learning
Chest X- AUC: 98.12% None Language encoder: BlueBERT Hybrid: matching and Combines PixelHop++ and
BERTHop
ray Vision encoder: PixelHop++ masking (masked BlueBERT for effective
[38]
Fusion method: Early fusion language modelling) visual-language fusion.
KAD [39] Chest X- Outperforms Excels with Language encoder: Cross-modal global Leverages medical
ray expert few-shot PubMedBERT contrastive learning and knowledge graphs for
radiologists on annotations Vision encoder: ResNet-50, hybrid with additional improved zero-shot
multiple ViT-16 classification objective performance and auto-
pathologies Fusion method: Late fusion diagnosis.
CLIPath Pathology Strong Efficient Language encoder: BERT Contrastive learning Fine-tunes CLIP using
[40] transferability adaptation Vision encoder: ResNet-50 or Residual Feature
with limited ViT Connection for pathology
data Fusion method: Early fusion image classification
ConVIRT Chest X- Competitive SOTA with Language encoder: BERT Global contrastive SOTA with few-shot
[41] ray performance few-shot Vision encoder: ResNet50 learning annotations
annotations Fusion method: No fusion

TABLE XIII. SUMMARY OF KEY ARCHITECTURES FOR MULTITASK LEARNING (MIC)

Works Method Data Effectiveness Advantages Limitations
Classification of bone tumors: 80.2% Assists in diagnostic
Mask- Selection bias, inability to
Supervised 934 radiographs accuracy, 62.9% sensitivity, and 88.2% workflow by accurately
RCNN- predict other diseases,
learning, Mask- (667 benign, specificity. placing bounding boxes,
X101 fixed image resolution,
RCNN-X101 267 malignant Bounding box placements: IoU of 0.52 segmenting, and
(2021, lack of bone metastases
architecture bone tumors) Segmentation: mean Dice score 0.60. classifying primary bone
[42]) and density information
tumors
Gland: (1602
Simultaneously predicts
Supervised GlaS + 3209
Segmentation: Nuclei 0.774, 0.560; Gland multiple tasks without
Cerberus learning, shared CRAG + 46346 Performance
0.908, 0.640; Lumen 0.666, 0.525; compromising
(2023, encoder (ResNet34) generated), enhancement in new tasks
Classification: mAP 0.948, mF1 0.883 performance, publishes
[43]) and independent Lumen: 56358, yet to be explored
processed TCGA dataset
decoders (U-Net) Nuclei: 495179

Parkinson's Early diagnosis of

Supervised Progression Parkinson's disease using Limited to node-level
MNC-Net
learning, graph Markers ACC 95.50%, F1 95.49%, Prec 97.00%, clinical scores and brain tasks, does not capture all
(2023,
encoder and cluster- Initiative Rec 94.42% regions, manages brain Parkinson's-related
[44])
layer (PPMI) MRI network complexity information
data effectively
Supervised
AD-NC, AD- NC-MCI: ACC 70.1, SEN 69.3, SPE 70.8, Did not explore potential
AMTI- learning, Addresses limitations in
MCI, NC-MCI, AUC 70.6, ADAS-Cog CC 0.477, MMSE correlations between
GCN interpretation, binary Alzheimer's
MCIn-MCIp CC 0.498; MCIn-MCIp: ACC 71.9, SEN ADAS-Cog, MMSE, and
(2024, feature sharing, and diagnosis and ignores task
(186-393 73.2, SPE 71.1, AUC 72.5, ADAS-Cog CC other factors like
[45]) task-specific correlation
samples) 0.485, MMSE CC 0.522 education level
modules
Slightly higher
computational
Active learning, Seg.: DSC 77.76%, IoU 67.40%, 95% HD Effectively addresses
TransMT- complexity, inferior
hybrid CNN- Polyp: 1,645 21.62 mm lesion classification and
Net (2023, segmentation
Transformer images Class: Acc 96.94%, Pre 96.56%, Rec segmentation in GI tract
[46]) performance with 70%
architecture 96.52%, F1 96.54%; endoscopic images
training set, varied
processing speed
CNN- Supervised Seg: BUSI (Acc 94.04, DC 83.42, IoU Effectively captures local
Inability to monitor
VisT-MLP- learning, hybrid BUSI: 789, 72.56, Sen 80.10); UDIAT (Acc 97.88, DC and high-level features in
tumor's surrounding
Mixer CNN-ViT UDIAT: 163 81.52, IoU 70.32, Sen 90.32); breast ultrasound images,
environment during
(2024, architecture and Class: Acc 86.00, Prec 86.11, Rec 86.02, enhances feature
diagnosis
[47]) MLP-Mixer F1 85.93, Sen 89.42, Spec 85.26 integration
4) Med-VLMs for multitask (classification and These Med-VLMs demonstrate several key advancements in
segmentation): Recent advancements in Med-VLMs have MIC:
significantly improved the accuracy and efficiency of MIC by a) Enhanced medical knowledge: Models like MedKLIP
leveraging the power of multimodal AI. These models excel at incorporate medical knowledge bases and text extraction
handling multitask challenges, such as simultaneous techniques to improve understanding of medical images.
classification and segmentation, leading to a more b) Improved representation learning: Techniques like
comprehensive understanding of medical images. Table XIV attention mechanisms and contrastive learning enable models
summarizes key Med-VLMs and their contributions to MIC. like GLoRIA and CONCH to learn more robust and efficient
representations of medical images.

277 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

c) Anatomical structure guidance: ASG (IRA) and even with limited labeled data, making them valuable in
MeDSLIP leverage anatomical information to improve scenarios with scarce data resources.
interpretability and clinical relevance, leading to more accurate Significantly, Med-VLMs are revolutionizing MIC by
classifications. leveraging the power of multimodal AI and multitask learning.
d) Multitask capabilities: Many of these models excel at These models offer enhanced diagnostic precision, efficiency,
both classification and segmentation tasks, providing a more and interpretability, ultimately leading to improved patient care
comprehensive analysis of medical images. and outcomes. As research in this area continues, we can expect
even more powerful and versatile Med-VLMs to emerge, further
e) Zero-Shot and few-shot learning: Several models,
transforming the field of medical imaging and healthcare as a
including GLoRIA and SAT, demonstrate strong performance
whole.

TABLE XIV. COMPARISON OF MED-VLMS FOR MULTITASK MEDICAL IMAGE ANALYSIS

Model Encoders and fusion method Pre-trained objective Key innovations Strengths
GLoRIA Language encoder: BioClinicalBERT Global and local contrastive Multimodal global-local Data efficiency, zero-shot
[48] Vision encoder: ResNet-50 learning approach, attention-weighted capabilities, excels in limited-
Fusion method: late fusion image regions label settings
ASG Language encoder: BioClinicalBERT Contrastive learning and Anatomical structure guidance, Improved interpretability and
(IRA) Vision encoder: ResNet-50 and ViT-B/16 image tag recognition image-report alignment clinical relevance, enhanced
[49] Fusion method: late fusion representation learning
MeDSLIP Language encoder: BioClinicalBERT Hybrid: Prototypical Dual-stream architecture for Precise vision-language
[50] Vision encoder: ResNet-50 contrastive learning and disentangling anatomical and associations, improved
Fusion method: late fusion intra-image contrastive pathological information performance in medical image
learning captioning and report generation
SAT Language encoder: BioClinicalBERT Contrastive learning Semantic-aware transformer for Effective representation
[51] Vision encoder: ResNet-50 integrating semantic information learning, excels in data/no-data
Fusion method: late fusion recognition tasks
CONCH Language encoder: GPT-style Hybrid: Contrastive learning Contrastive learning from SOTA performance in histology
[52] Transformer and captioning objective captions for histopathology image classification,
Vision encoder: ViT-Base images segmentation, and retrieval tasks
Fusion method: early fusion
ECAMP Language encoder: BERT Hybrid: masked image Entity-centered context-aware Enhanced text-image interplay,
[53] Vision encoder: ViT-B/16 modeling, masked language pre-training, multi-scale context improved performance in
Fusion method: early fusion (multi-scale modeling, and context- fusion downstream medical imaging
context fusion) guided super-resolution tasks

V. RECENT ADVANCES IN LEVEL 3 OF MIC (SPECIFIC

APPLICATIONS)
A. Medical Image Data
1) Medical imaging modalities: Medical imaging plays a
critical role in modern healthcare, offering non-invasive
visualization of the human body for diagnosis and treatment
planning. Various modalities, including X-ray, CT, MRI,
ultrasound, PET, and SPECT, provide unique insights into
Fig. 3. Illustration of the diverse imaging techniques [70].
different organs and tissues (Fig. 3 [70]; Fig. 4 [71] illustrates
the diverse applications of these modalities across various
anatomical structures.
The Medical Segmentation Decathlon dataset [72]
exemplifies this diversity, encompassing 2,633 3D images
spanning ten different organs (Fig. 5). Each modality possesses
distinct characteristics, advantages, and limitations,
necessitating careful selection based on the specific clinical
scenario. Understanding these nuances is crucial for optimal
utilization of medical imaging technology. A concise
comparison of imaging techniques reveals their unique
advantages and limitations is presented in Table XV.

Fig. 4. An overview of the organs and corresponding medical imaging

modalities [71].

278 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

generate new data, and enable decentralized learning, thus

enhancing the robustness and diversity of medical imaging
applications.
4) Summary: Medical imaging is a cornerstone of modern
medical diagnostics, with each modality serving specific
purposes based on clinical needs. The advent of AI and machine
learning, alongside the proliferation of public datasets, is
revolutionizing medical imaging research, promising more
accurate disease detection and personalized medicine. The
future of medical imaging lies in harnessing these technological
advancements to improve healthcare outcomes.
General comment
 Ionizing radiation (X-ray and CT) can be harmful,
especially for pregnant women.
 MRI offers the highest detail without radiation but is
expensive and not suitable for everyone.
 Ultrasound is safe and widely available but offers less
Fig. 5. An illustration of the Medical Segmentation decathlon's ten distinct detail.
tasks [72].
 PET and SPECT provide functional information but
2) Public databases in medical imaging research: The involve radioactive materials.
growth of public medical image databases has been crucial in B. Advancements in Medical Imaging Diagnosis: From CAD
advancing disease classification research. Noteworthy to AI-CAD
databases include: AI integration has profoundly transformed various domains,
 ChestX-ray141: Over 100,000 chest X-ray images. notably evident in medical diagnostic imaging. This shift marks
a significant departure from conventional Computer-Aided
 MURA2: More than 40,000 X-ray images of bones and Diagnosis (CAD) to AI-driven CAD systems, ushering in a new
joints. era of diagnostic capabilities,
 NIH Clinical Center’s dataset: 3 A diverse range of The evolution began in the 1960s with CAD systems aiming
modalities. to automate diagnostic processes. A significant milestone was
the FDA's approval of a mammography CAD device by R2
 ISIC 4: Skin image collection for lesion detection.
Technology, Inc., in 1998, marking the start of the "CAD era."
 DeepLesion5: Nearly 10,600 CT scans. Endorsement for reimbursement by the Centers for Medicare
and Medicaid Services in 2002 further accelerated CAD's
 CheXpert:6 Over 224,000 chest radiographs. adoption across modalities like chest radiographs and CT scans.
 MIMIC-CXR7: over 377,000 chest radiographs CAD systems encompass three categories based on their role
Platforms like the World Health Data Hub of the WHO , 8 in image interpretation: second-reader, concurrent-reader, and
Medical ImageNet 9 , Kaggle 10 , and PaperswithCode 11 , offer first-reader types (Fig. 6 [73]). Notably, interactive CAD falls
extensive resources for machine learning research in medical under the first-reader type. The evolution of CAD architecture
imaging, showcasing the collaborative and open nature of has transitioned from sequential interpretation (seen in second-
contemporary scientific inquiry. reader CAD Fig. 6(a)) to simultaneous interpretation
(concurrent-reader CAD Fig. 6(b)), streamlining the diagnostic
3) Advanced techniques in medical imaging research: process by integrating CAD results from the outset. The advent
Innovative computational techniques such as augmentation, of first-reader CAD (Fig. 6(c)) presents a novel approach where
transfer learning, Generative Adversarial Networks (GANs), CAD autonomously conducts initial interpretation, guiding the
and Federated Learning are pushing the boundaries of medical physician's analysis solely on CAD-marked images, showing
imaging research. These methods improve model performance, promise for mass screenings like mammography.

1 7
https://nihcc.app.box.com/v/ChestXray-NIHCC https://physionet.org/content/mimic-cxr/2.0.0/
8
2
https://stanfordmlgroup.github.io/competitions/mura/ https://www.who.int/data/
9
3
https://clinicalcenter.nih.gov/ https://aimi.stanford.edu/medical-imagenet
10
4
https://isdis.org/ https://www.kaggle.com/datasets
11
5
https://camelyon17.grand-challenge.org/ https://paperswithcode.com/datasets
6
https://aimi.stanford.edu/chexpert-chest-x-rays

279 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

TABLE XV. COMPARISON OF MEDICAL IMAGING MODALITIES IN MIC

Technique Description Pros Cons Safety and Image Detail

Limited soft tissue contrast,
Quick, painless, cost- Radiation risk: Moderate.
Examines bones, detects fractures, tumors, ionizing radiation exposure.
X-ray effective, immediate results, Image detail: Low. Best for
and infections. Not suitable for detailed
widely available. bone visualization.
organ visualization.

High-resolution images, fast Ionizing radiation exposure.

Detailed cross-sectional images of the body. acquisition time, useful for Limited soft tissue contrast Radiation risk: High. Image
CT Examines organs, blood vessels, and detects diagnosing trauma, compared to MRI. Not detail: High. Excellent for
abnormalities. differentiates tissue suitable for pregnant women visualizing organs and bone.
densities. due to radiation risks.

Superior soft tissue contrast, Expensive, long scan times, Radiation risk: None. Image
Detailed images of internal structures.
no ionizing radiation. contraindicated for patients detail: Very High. Best for
MRI Assesses brain, spinal cord, joints, and
Multiplanar imaging, detects with certain metallic soft tissue and organ
organs.
subtle abnormalities. implants. visualization.

Operator-dependent, limited
Real-time imaging, non- Radiation risk: None. Image
Uses sound waves to produce real-time penetration in obese patients,
invasive, safe, portable, detail: Moderate. Best for
Ultrasound images. Examines abdomen, pelvis, heart, and less detailed images
widely available, no ionizing real-time imaging and
monitors fetal development. compared to other
radiation. pregnancy monitoring.
modalities.

Provides functional Radiation risk: Low. Image

Visualizes metabolic processes. Detects Expensive, limited spatial
information, detects diseases detail: Moderate. Best for
PET cancer, assesses treatment response, evaluates resolution, radioactive
early, helps in personalized visualizing metabolic
brain disorders. material involved.
medicine. activity.

Longer acquisition time than Radiation risk: Low. Image

Detects gamma rays emitted by a tracer. Non-invasive, provides
PET, lower sensitivity than detail: Moderate. Best for
SPECT Assesses blood flow, detects myocardial functional information, good
PET, radioactive material visualizing blood flow and
infarctions, and evaluates brain disorders. spatial resolution.
involved. brain function.

extensive datasets to enhance performance. Fig. 7 illustrates the

superior performance of deep learning-based AI-CAD
compared to traditional CAD systems, particularly with
increasing data volume.

Fig. 6. Categorization of CAD systems in medical imaging interpretation: a)

Second-reader, b) Concurrent-reader, and c) First-reader types [73]. Fig. 7. Development processes: a) Conventional CAD vs. b) Deep learning-
based AI-CAD [73].
Despite CAD's acknowledged utility, persistent challenges
include high development costs, elevated false-positive rates In conclusion, the shift from CAD to AI-CAD represents a
leading to increased recalls and biopsies, and limited clinical significant advancement in medical imaging diagnosis, offering
efficacy. These challenges are well-documented in clinical increased accuracy, efficiency, and versatility. As AI matures,
studies, emphasizing the need for AI-driven solutions. its integration has the potential to revolutionize healthcare
delivery, providing clinicians with sophisticated diagnostic tools
The recent introduction of AI-CAD, primarily employing for precise and timely patient care.
deep learning methodologies, signifies a significant
advancement. Deep learning algorithms have proven effective Remarkable Applications of AI-CAD
in reducing interpretation time and improving diagnostic  Breast Cancer: AI-CAD systems have demonstrated
accuracy, as demonstrated by studies like Kyono et al., which significant potential in breast cancer screening and
explored deep learning's potential to ease radiologists' workload mammography interpretation. Systems like cmAssist
in mammography screenings. AI-CAD's reliance on deep [54] can reduce false-positive markings by up to 69%,
learning adopts a data-driven approach, benefiting from minimizing unnecessary follow-up procedures and

280 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

patient anxiety. Deep learning models have shown Overall, AI-CAD systems have shown remarkable potential
accuracy comparable to experienced radiologists, with in various medical imaging applications, from breast cancer
some hybrid models outperforming human experts. The screening to tuberculosis detection, eye disease diagnosis, skin
AI-STREAM study [55] aims to generate real-world cancer diagnosis, and other pathological conditions. By
evidence on the benefits and drawbacks of AI-based leveraging the power of deep learning and large datasets, these
computer-aided detection/diagnosis (CADe/x) for breast systems can augment and enhance human expertise, leading to
cancer screening. improved diagnostic accuracy, efficiency, and accessibility in
healthcare.
 Tuberculosis Detection: AI-based CAD systems can
assist in community-based active case finding for C. Recent Research Trends in Medical Image Classification
tuberculosis, especially in areas with limited access to and Cancer Statistics (2020-2024)
experienced physicians. Okada et al. [56] demonstrated Recent statistics from representative journals using
the applicability of AI-CAD for pulmonary tuberculosis keywords related to medical image classification cover the latest
in community-based active case findings, showing advancements from 2020 to 2024. In addition, the 2024 Cancer
performance levels nearing human experts. This Statistics [74] indicate a 33% decrease in cancer deaths in the
approach holds promise in triaging and screening U.S. since 1991, attributed to reduced smoking, earlier
tuberculosis, with significant implications for addressing detection, and improved treatments. However, the incidence of
healthcare professional shortages in low- and middle- six major cancers continues to rise, with colorectal cancer
income countries. Such advancements contribute to the becoming a leading cause of death among men under 50. Efforts
World Health Organization's goal of "Ending like the Persistent Poverty Initiative aim to mitigate cancer
tuberculosis" by 2030. outcomes' impact of poverty, emphasizing the need for
 Eye Disease Diagnosis: Google's deep learning analysis increased investment in prevention and disparity reduction.
[57] achieved a detection sensitivity of about 98% in The document concludes with a projection of the top ten
diagnosing eye diseases. AI analysis of fundus cancer types for new cases and deaths in the United States for
photographs [58] can assist in diagnosing not only eye 2024, underscoring the ongoing challenge and importance of
diseases but also systemic conditions like heart disease, advancements in medical imaging diagnosis.
surpassing human capabilities.
VI. CHALLENGES AND ADVANCEMENTS IN MIC
 Skin Cancer Diagnosis: AI demonstrates accuracy
equivalent to or higher than dermatologists in diagnosing While MIC has experienced significant progress, challenges
skin cancer, utilizing deep learning on large datasets of remain in data limitations, algorithm development, and
skin lesions. Studies have shown AI achieving diagnostic healthcare integration. This section explores these challenges
accuracy comparable to dermatologists [59] and even and proposes innovative solutions to advance the field.
outperforming them in differentiating melanoma [60].
TABLE XVI. FIVE-YEAR STATISTICS OF MEDICAL IMAGE CLASSIFICATION
 Bone diseases: The use of AI, particularly deep learning, RESEARCH IN FOUR REPRESENTATIVE JOURNALS (2020-2024)
is gaining traction in the medical community for Classes Springer Sciencedirect IEEE PubMed
diagnosing and treating bone diseases. Recent
1 cancer 4064 3474 748 291
applications focus on segmentation and classification of
bone tumors and lesions in medical images. For instance, 2 brain 3599 2984 523 112
Zhan et al. [61] developed SEAGNET, a novel 3 tumor 2440 2789 436 103
framework for segmenting malignant bone tumors. 4 lesion 2378 3035 286 81
Yildiz Potter et al. [62] explored a multi-task learning 5 lung 2374 2102 433 98
approach for automated bone tumor segmentation and 6 Breast 2019 1815 309 110
classification. Additionally, Ye et al. [63] investigated an
7 eye 1979 1602 144 39
ensemble multi-task deep learning framework for the
detection, segmentation, and classification of bone 8 COVID 1894 1460 343 107
tumors and infections using multi-parametric MRI. 9 skin 1865 1647 241 71
These studies highlight the potential of deep learning to 10 Heart 1547 1489 121 19
significantly improve the accuracy and efficiency of 11 AIDS 964 738 30 3
diagnosing and treating bone diseases. 12 liver 898 971 61 25
 Other Pathological Applications: AI has demonstrated 13 bone 847 938 83 32
superior performance in detecting lymph node metastasis 14 cardiac 722 898 30 14
of breast cancer [64] and detecting diabetes from fundus 15 prostate 581 638 25 14
photographs [65] with high sensitivity and specificity. 16 kidney 541 696 32 20
These applications underscore AI's potential in
17 tuberculosis 471 321 49 4
enhancing the accuracy and efficiency of medical
imaging diagnosis, ultimately improving patient 18 colorectal 442 494 35 22
outcomes and healthcare delivery. 19 Malaria 178 115 25 6

281 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

c) Regulatory approval: Joshi and Bhandari [83] provide

an updated landscape of FDA-approved AI/ML-enabled
medical devices, offering insights into navigating regulatory
requirements.
Future Directions and Research Opportunities:
This review highlights various challenges in medical image
classification and presents potential solutions based on recent
research. However, it's important to note that these solutions
require further validation in specific clinical contexts. To
advance the field, researchers should consider:
 Conducting comparative studies of different approaches
to address each challenge.
 Validating the proposed solutions in diverse clinical
settings and with larger datasets.
Fig. 8. Projected top ten cancer types for new cases and deaths in the United
States for 2024, by gender [74].  Investigating the integration of multiple solutions to
address complex, real-world scenarios in medical image
A. Challenges and Solutions in MIC classification.
1) Medical image data:  Exploring the ethical implications and potential biases of
a) Limited labeled data: Transfer learning has shown AI systems in healthcare.
promise in addressing the scarcity of labeled data. Kim et al.
[75] provide a comprehensive review of transfer learning B. Key Advancements in MIC Techniques
methods for MIC. Additionally, FSL, ZSL, and Med-VLM a) Transformers vs. CNNs: Evidence suggests
have been explored as potential solutions, as mentioned in Transformers like ViT and DeiT demonstrate promising results
previous sections. Fig. 8 shows the projected top ten cancer compared to traditional CNNs, especially in capturing global
types for new cases and deaths in the United States. context and long-range dependencies.
b) Inter-class similarity and imbalanced datasets: Islam b) Synergy of transformers and CNNs: Hybrid models
et al. [76] introduced CosSIF, a cosine similarity-based image like MedViT and TransMT-Net leverage the strengths of both
filtering method for synthetic medical image datasets to architectures, achieving superior performance in classification
improve accuracy when dealing with high inter-class similarity. and segmentation tasks.
For imbalanced datasets, Huynh et al. [77] propose a semi- c) Med-VLMs for multitask MIC: Integrating Med-VLMs
supervised learning approach for MIC. into multitask learning frameworks improves performance by
c) Large image sizes and domain shift: Sreenivasulu and effectively aligning visual and textual information.
Varadarajan [78] present an efficient lossless ROI image d) AI for tumor classification: AI models demonstrate
compression technique to address computational challenges impressive accuracy in distinguishing between benign and
posed by large image dimensions. Guan and Liu [79] provide a malignant tumors, with potential to augment clinical decision-
survey on domain adaptation methods for medical image making.
analysis, highlighting techniques to improve model e) FSL addresses the challenge of limited labeled data by
generalizability across different datasets and populations. enabling models to generalize effectively from a small number
2) Clinical data: of examples. Researchs have demonstrated that FSL can
a) Data privacy, security, and accessibility: Kaissis et al. achieve high accuracy in tasks such as tumor detection with
[80] discuss secure, privacy-preserving, and federated machine minimal data, highlighting its potential in clinical applications.
learning approaches in medical imaging, addressing crucial f) ZSL tackles the issue of classifying unseen categories
aspects of protecting patient data while improving access to by leveraging semantic relationships. ZSL has shown
clinical data. promising results in identifying rare diseases and novel medical
3) Practical application challenges conditions, significantly aiding in early diagnosis and treatment
a) Model interpretability: Alam et al. [81] explore LRP planning.
and Grad-CAM visualization techniques to interpret multi- g) XAI techniques enhance the interpretability and
label-multi-class pathology prediction using chest radiography, trustworthiness of MIC systems, making them more acceptable
enhancing model interpretability. in clinical practice. Additionally, XAI contributes to optimizing
b) Model validation: Ramezan et al. [82] evaluate model performance and accuracy by providing insights that
sampling and cross-validation tuning strategies for regional- allow for iterative model adjustments.
scale machine learning classification, ensuring model
performance and generalizability.

282 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

C. Summary [12] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,”
Addressing data challenges, refining algorithms, and arXiv preprint arXiv:1409.4842, 2014.
ensuring responsible implementation are crucial for advancing [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
MIC. The integration of Transformers, CNNs, Med-VLMs, and recognition,” arXiv [cs.CV], 2015.
XAI techniques holds immense potential for improving [14] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for
healthcare delivery and patient outcomes. With the increasing convolutional Neural Networks,” arXiv [cs.LG], 2019.
focus on explainability and trustworthiness in AI models, further [15] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
breakthroughs in MIC and its transformative impact on convolutional networks,” arXiv [cs.LG], 2016.
healthcare can be anticipated. [16] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y.
Bengio, “Graph Attention Networks,” arXiv [stat.ML], 2017.
VII. CONCLUSION AND FUTURE DIRECTIONS [17] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for
image recognition at scale,” arXiv [cs.CV], 2020.
The paper outlines the development of medical image
[18] Touvron, Hugo, et al. "Training data-efficient image transformers &
classification through three solution levels: basic, specific, and distillation through attention." International conference on machine
applied. It discusses traditional high-performance deep learning learning. PMLR, 2021.
models and highlights the promising vision-language models [19] J. Chen et al., “TransUNet: Transformers make strong encoders for
that can explain predictions. The paper also emphasizes the medical image segmentation,” arXiv [cs.CV], 2021.
potential of multimodal models combining clinical and [20] Y. Liu, H. Wang, Z. Chen, K. Huangliang, and H. Zhang, “TransUNet＋
paraclinical data for disease diagnosis and treatment. It notes the : Redesigning the skip connection to enhance features in medical image
research community's growing interest in early prediction to segmentation,” Knowl. Based Syst., vol. 256, no. 109859, p. 109859,
2022.
reduce risks and the role of Explainable Artificial Intelligence in
[21] A. Jamali, S. K. Roy, J. Li, and P. Ghamisi, “TransU-Net++: Rethinking
improving predictive results. The application of AI in Computer attention gated TransU-Net for deforestation mapping,” Int. J. Appl. Earth
Vision for medical purposes consistently surpasses expectations, Obs. Geoinf., vol. 120, no. 103332, p. 103332, 2023.
indicating a future focus on integrating AI advancements into [22] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should I trust you?:
diagnostic and treatment-related problems using multimodal Explaining the predictions of any classifier,” in Proceedings of the 22nd
data. ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2016, pp. 1135–1144.
ACKNOWLEDGMENT [23] S. Lundberg and S.-I. Lee, “A unified approach to interpreting model
predictions,” arXiv [cs.AI], 2017.
This research is supported by Dept. of Computer Vision and
[24] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning
Cognitive Cybernetics, Faculty of Information Technology, deep features for discriminative localization,” in 2016 IEEE Conference
University of Science, Vietnam National University-Ho Chi on Computer Vision and Pattern Recognition (CVPR), 2016.
Minh City. [25] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D.
Batra, “Grad-CAM: Visual explanations from deep networks via
REFERENCES Gradient-based localization,” arXiv [cs.CV], 2016.
[1] S. Zhang et al., “BiomedCLIP: a multimodal biomedical foundation [26] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
model pretrained from fifteen million scientific image-text pairs,” arXiv “Grad-CAM++: Improved visual explanations for deep convolutional
[cs.CV], 2023. networks,” arXiv [cs.CV], 2017.
[2] O. Thawkar et al., “XrayGPT: Chest radiographs summarization using [27] M. Xue et al., “ProtoPFormer: Concentrating on prototypical parts in
medical vision-language models,” arXiv [cs.CV], 2023. vision transformers for interpretable image recognition,” arXiv [cs.CV],
[3] C. Liu et al., “M-FLAG: Medical vision-language pre-training with frozen 2022.
language models and Latent spAce Geometry optimization,” arXiv [28] L. Yu and W. Xiang, “X-Pruner: eXplainable pruning for vision
[cs.CV], 2023. transformers,” arXiv [cs.CV], 2023.
[4] Q. Chen, X. Hu, Z. Wang, and Y. Hong, “MedBLIP: Bootstrapping [29] S. M. Dipto, M. T. Reza, M. N. J. Rahman, M. Z. Parvez, P. D. Barua,
language-image pre-training from 3D medical images and texts,” arXiv and S. Chakraborty, “An XAI integrated identification system of white
[cs.CV], 2023. blood cell type using variants of vision transformer,” in Lecture Notes in
[5] S. Bannur et al., “Learning to exploit temporal structure for biomedical Networks and Systems, Cham: Springer Nature Switzerland, 2023, pp.
vision-language processing,” arXiv [cs.CV], 2023. 303–315.
[6] Z. Wang, Q. Sun, B. Zhang, P. Wang, J. Zhang, and Q. Zhang, “PM2: A [30] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
new prompting multi-modal model paradigm for few-shot medical image Networks for Biomedical Image Segmentation,” in Lecture Notes in
classification,” arXiv [cs.CV], 2024. Computer Science, Cham: Springer International Publishing, 2015, pp.
234–241.
[7] H. Luo, Z. Zhou, C. Royer, A. Sekuboyina, and B. Menze, “DeViDe:
Faceted medical knowledge for improved medical vision-language pre- [31] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “UNet++:
training,” arXiv [cs.CV], 2024. A Nested U-Net Architecture for Medical Image Segmentation,” in Deep
Learning in Medical Image Analysis and Multimodal Learning for
[8] Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “MedCLIP: Contrastive Clinical Decision Support, Cham: Springer International Publishing,
learning from unpaired medical images and text,” arXiv [cs.CV], 2022. 2018, pp. 3–11.
[9] E. Tiu, E. Talius, P. Patel, C. P. Langlotz, A. Y. Ng, and P. Rajpurkar, [32] S. Mishra, “Malaria Parasite Detection using Efficient Neural
“Expert-level detection of pathologies from unannotated chest X-ray Ensembles,” j.electron.electromedical.eng.med.inform, vol. 3, no. 3, pp.
images via self-supervised learning,” Nat. Biomed. Eng., vol. 6, no. 12, 119–133, 2021.
pp. 1399–1406, 2022.
[33] C. F. Chong, Y. Wang, B. Ng, W. Luo, and X. Yang, “Image projective
[10] C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “MedKLIP: Medical transformation rectification with synthetic data for smartphone-captured
knowledge enhanced language-image pre-training in radiology,” arXiv chest X-ray photos classification,” Comput. Biol. Med., vol. 164, p.
[eess.IV], 2023. 107277, 2023.
[11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv [cs.CV], 2014.

283 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

[34] H. Wang et al., “CCF-GNN: A unified model aggregating appearance, [55] Y. Chang et al., “Artificial intelligence for breast cancer screening in
microenvironment, and topology for pathology image classification,” mammography (AI-STREAM): Preliminary interim analysis of a
IEEE Trans. Med. Imaging, vol. 42, no. 11, pp. 3179–3193, 2023. prospective multicenter cohort study,” 2024.
[35] B. Wang et al., “GazeGNN: A gaze-guided graph neural network for chest [56] K. Okada et al., “Applicability of artificial intelligence-based computer-
X-ray classification,” arXiv [cs.CV], 2023. aided detection (AI–CAD) for pulmonary tuberculosis to community-
[36] F. Almalik, M. Yaqub, and K. Nandakumar, “Self-Ensembling Vision based active case finding,” Trop. Med. Health, vol. 52, no. 1, 2024.
Transformer (SEViT) for Robust Medical Image Classification,” arXiv [57] V. Gulshan et al., “Development and validation of a deep learning
[cs.CV], 2022. algorithm for detection of diabetic retinopathy in retinal fundus
[37] O. N. Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, and A. photographs,” JAMA, vol. 316, no. 22, p. 2402, 2016.
Ayatollahi, “MedViT: A robust vision transformer for generalized [58] R. Poplin et al., “Prediction of cardiovascular risk factors from retinal
medical image classification,” Comput. Biol. Med., vol. 157, no. 106791, fundus photographs via deep learning,” Nat. Biomed. Eng., vol. 2, no. 3,
p. 106791, 2023. pp. 158–164, 2018.
[38] M. Monajatipoor, M. Rouhsedaghat, L. H. Li, C.-C. Jay Kuo, A. Chien, [59] A. Esteva et al., “Dermatologist-level classification of skin cancer with
and K.-W. Chang, “BERTHop: An effective vision-and-language model deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
for chest X-ray disease diagnosis,” in Lecture Notes in Computer Science, [60] H. A. Haenssle et al., “Man against machine: diagnostic performance of a
Cham: Springer Nature Switzerland, 2022, pp. 725–734. deep learning convolutional neural network for dermoscopic melanoma
[39] X. Zhang, C. Wu, Y. Zhang, W. Xie, and Y. Wang, “Knowledge- recognition in comparison to 58 dermatologists,” Ann. Oncol., vol. 29,
enhanced visual-language pre-training on chest radiology images,” Nat. no. 8, pp. 1836–1842, 2018.
Commun., vol. 14, no. 1, p. 4542, 2023. [61] X. Zhan et al., “An intelligent auxiliary framework for bone malignant
[40] Z. Lai, Z. Li, L. C. Oliveira, J. Chauhan, B. N. Dugger, and C.-N. Chuah, tumor lesion segmentation in medical image analysis,” Diagnostics
“CLIPath: Fine-tune CLIP with visual feature fusion for pathology image (Basel), vol. 13, no. 2, p. 223, 2023.
analysis towards minimizing data collection efforts,” 2023 IEEE/CVF [62] I. Yildiz Potter et al., “Automated bone tumor segmentation and
International Conference on Computer Vision Workshops (ICCVW), pp. classification as benign or malignant using computed tomographic
2366–2372, 2023. imaging,” J. Digit. Imaging, vol. 36, no. 3, pp. 869–878, 2023.
[41] Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, [63] Q. Ye et al., “Automatic detection, segmentation, and classification of
“Contrastive Learning of Medical Visual Representations from Paired primary bone tumors and bone infections using an ensemble multi-task
Images and Text,” in Proceedings of the 7th Machine Learning for deep learning framework on multi-parametric MRIs: a multi-center
Healthcare Conference, 05--06 Aug 2022, vol. 182, pp. 2–25. study,” Eur. Radiol., 2023.
[42] C. E. von Schacky et al., “Multitask deep learning for segmentation and [64] B. Ehteshami Bejnordi et al., “Diagnostic assessment of deep learning
classification of primary bone tumors on radiographs,” Radiology, vol. algorithms for detection of lymph node metastases in women with breast
301, no. 2, pp. 398–406, 2021. cancer,” JAMA, vol. 318, no. 22, p. 2199, 2017.
[43] S. Graham et al., “One model is all you need: Multi-task learning enables [65] M. D. Abràmoff, P. T. Lavin, M. Birch, N. Shah, and J. C. Folk, “Pivotal
simultaneous histology image segmentation and classification,” Med. trial of an autonomous AI-based diagnostic system for detection of
Image Anal., vol. 83, no. 102685, p. 102685, 2023. diabetic retinopathy in primary care offices,” NPJ Digit. Med., vol. 1, no.
[44] L. Huang, X. Ye, M. Yang, L. Pan, and S. H. Zheng, “MNC-Net: Multi- 1, p. 39, 2018.
task graph structure learning based on node clustering for early [66] Zhu, G., Jiang, B., Tong, L., Xie, Y., Zaharchuk, G., & Wintermark, M.
Parkinson’s disease diagnosis,” Comput. Biol. Med., vol. 152, no. (2019). Applications of deep learning to neuro-imaging techniques.
106308, p. 106308, 2023. Frontiers in Neurology, 10, 869. https://doi.org/10.3389/fneur.2019.0086
[45] S. Jiang, Q. Feng, H. Li, Z. Deng, and Q. Jiang, “Attention based multi- G. Zhu, B. Jiang, L. Tong, Y. Xie, G. Zaharchuk, and M. Wintermark,
task interpretable graph convolutional network for Alzheimer’s disease “Applications of deep learning to neuro-imaging techniques,” Front.
analysis,” Pattern Recognit. Lett., vol. 180, pp. 1–8, 2024. Neurol., vol. 10, p. 869, 2019.
[46] S. Tang et al., “Transformer-based multi-task learning for classification [67] A. Apicella, L. Di Lorenzo, F. Isgrò, A. Pollastro, and R. Prevete,
and segmentation of gastrointestinal tract endoscopic images,” Comput. “Strategies to exploit XAI to improve classification systems,” in
Biol. Med., vol. 157, no. 106723, p. 106723, 2023. Communications in Computer and Information Science, Cham: Springer
[47] J. Tagnamas, H. Ramadan, A. Yahyaouy, and H. Tairi, “Multi-task Nature Switzerland, 2023, pp. 147–159.
approach based on combined CNN-transformer for efficient segmentation [68] A. Apicella, S. Giugliano, F. Isgrò, A. Pollastro, and R. Prevete, “An XAI-
and classification of breast tumors in ultrasound images,” Vis. Comput. based masking approach to improve classification systems,”
Ind. Biomed. Art, vol. 7, no. 1, 2024. BEWARE@AI*IA, pp. 79–83, 2023.
[48] S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, “GLoRIA: A [69] L. Dao and N. Q. Ly, “A comprehensive study on medical image
multimodal global-local representation learning framework for label- segmentation using deep neural networks,” Int. J. Adv. Comput. Sci.
efficient medical image recognition,” in 2021 IEEE/CVF International Appl., vol. 14, no. 3, 2023.
Conference on Computer Vision (ICCV), 2021, pp. 3942–3951. [70] T. Beyer et al., “What scans we will read: imaging instrumentation trends
[49] Q. Li et al., “Anatomical Structure-Guided medical vision-language pre- in clinical oncology,” Cancer Imaging, vol. 20, no. 1, pp. 1–38, 2020.
training,” arXiv [cs.CV], 2024. [71] D. M. H. Nguyen et al., “LVM-Med: Learning large-scale self-supervised
[50] F W. Fan et al., “MeDSLIP: Medical Dual-Stream Language-Image Pre- vision models for medical imaging via second-order graph matching,”
training for fine-grained alignment,” arXiv [cs.CV], 2024. arXiv [cs.CV], 2023.
[51] B. Liu et al., “Improving medical vision-language contrastive pretraining [72] M. Antonelli, A. Reinke, S. Bakas, K. Farahani, and M. Jorge Cardoso,
with semantics-aware triage,” IEEE Trans. Med. Imaging, vol. 42, no. 12, “The Medical Segmentation Decathlon,” Nature Communications, vol.
pp. 3579–3589, 2023. 13, no. 1, p. 4128, 2022.
[52] M. Y. Lu et al., “A visual-language foundation model for computational [73] H. Fujita, “AI-based computer-aided diagnosis (AI-CAD): the latest
pathology,” Nat. Med., vol. 30, no. 3, pp. 863–874, 2024. review to read first,” Radiol. Phys. Technol., vol. 13, no. 1, pp. 6–19,
[53] R. Wang et al., “ECAMP: Entity-centered context-aware Medical Vision 2020.
language pre-training,” arXiv [cs.CV], 2023. [74] R. L. Siegel, A. N. Giaquinto, and A. Jemal, “Cancer statistics, 2024,” CA
[54] R. C. Mayo, D. Kent, L. C. Sen, M. Kapoor, J. W. T. Leung, and A. T. Cancer J. Clin., vol. 74, no. 1, pp. 12–49, 2024.
Watanabe, “Reduction of false-positive markings on mammograms: A [75] H. E. Kim, A. Cosa-Linan, N. Santhanam, M. Jannesari, M. E. Maros, and
retrospective comparison study using an artificial intelligence-based T. Ganslandt, “Transfer learning for medical image classification: a
CAD,” J. Digit. Imaging, vol. 32, no. 4, pp. 618–624, 2019. literature review,” BMC Med. Imaging, vol. 22, no. 1, 2022.

284 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 7, 2024

[76] M. Islam, H. Zunair, and N. Mohammed, “CosSIF: Cosine similarity- [81] M. U. Alam, J. R. Baldvinsson, and Y. Wang, “Exploring LRP and Grad-
based image filtering to overcome low inter-class variation in synthetic CAM visualization to interpret multi-label-multi-class pathology
medical image datasets,” arXiv [cs.CV], 2023. prediction using chest radiography,” in 2022 IEEE 35th International
[77] T. Huynh, A. Nibali, and Z. He, “Semi-supervised learning for medical Symposium on Computer-Based Medical Systems (CBMS), 2022, pp.
image classification using imbalanced training data,” arXiv [cs.CV], 258–263.
2021. [82] C. A. Ramezan, T. A. Warner, and A. E. Maxwell, “Evaluation of
[78] P. Sreenivasulu and S. Varadarajan, “An efficient lossless ROI image sampling and cross-validation tuning strategies for regional-scale
compression using wavelet-based modified region growing algorithm,” J. machine learning classification,” Remote Sens. (Basel), vol. 11, no. 2, p.
Intell. Syst., vol. 29, no. 1, pp. 1063–1078, 2019. 185, 2019.
[79] H. Guan and M. Liu, “Domain adaptation for medical image analysis: A [83] G. Joshi and M. Bhandari, “FDA approved Artificial Intelligence and
survey,” arXiv [cs.CV], 2021. Machine Learning (AI/ML)-Enabled Medical Devices: An updated 2022
landscape,” Research Square, 2022.
[80] G. A. Kaissis, M. R. Makowski, D. Rückert, and R. F. Braren, “Secure,
privacy-preserving and federated machine learning in medical imaging,”
Nat. Mach. Intell., vol. 2, no. 6, pp. 305–311, 2020.

285 | P a g e
www.ijacsa.thesai.org