0% found this document useful (0 votes)
76 views29 pages

A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]

This document reviews the development and application of large language models (LLMs) in medicine, highlighting their potential to enhance clinical diagnostics and medical education. It addresses key questions regarding the practices for developing medical LLMs, their performance in medical tasks, real-world applications, and the challenges faced in their implementation. The review aims to provide insights and practical guidelines for effectively constructing and deploying medical LLMs to improve healthcare outcomes.

Uploaded by

Yassine L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views29 pages

A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]

This document reviews the development and application of large language models (LLMs) in medicine, highlighting their potential to enhance clinical diagnostics and medical education. It addresses key questions regarding the practices for developing medical LLMs, their performance in medical tasks, real-world applications, and the challenges faced in their implementation. The review aims to provide insights and practical guidelines for effectively constructing and deploying medical LLMs to improve healthcare outcomes.

Uploaded by

Yassine L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

A Survey of Large Language Models in Medicine:

Progress, Application, and Challenge


Hongjian Zhou1,* , Fenglin Liu1,*,† Boyang Gu2,* , Xinyu Zou3,* , Jinfa Huang4,* , Jinge Wu5 ,
Yiru Li6 , Sam S. Chen7 , Peilin Zhou8 , Junling Liu9 , Yining Hua10 , Chengfeng Mao11 ,
Chenyu You12 , Xian Wu13 , Yefeng Zheng13 , Lei Clifton1 , Zheng Li14,† , Jiebo Luo4,† , David
A. Clifton1,15,†
* Core Contributors, ordered by a coin toss. † Corresponding Authors.
1 University of Oxford, 2 Imperial College London, 3 University of Waterloo,
4 University of Rochester, 5 University College London, 6 Western University,
7 University of Georgia, 8 Hong Kong University of Science and Technology (Guangzhou),
9 Alibaba, 10 Harvard T.H. Chan School of Public Health, 11 Massachusetts Institute of Technology,
12 Yale University, 13 Tencent, 14 Amazon, 15 Oxford-Suzhou Centre for Advanced Research

{hongjian.zhou@cs,fenglin.liu@eng,david.clifton@eng}.ox.ac.uk,
[email protected], {jhuang90@ur,jluo@cs}.rochester.edu

ABSTRACT

Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding
and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in
supporting different medical tasks (e.g., enhancing clinical diagnostics, and providing medical education), a comprehensive
review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce.
Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including
the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles
of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data
used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific
needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks,
and further compare them with state-of-the-art lightweight models, aiming to provide a clear understanding of the distinct
advantages and limitations of LLMs in medicine. Overall, in this review, we address the following study questions: 1) What
are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical
setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of
medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this
review aims to provide insights into the opportunities and challenges of LLMs in medicine and serve as a practical resource
for constructing effective medical LLMs. We also maintain a regularly updated list of practical guides on medical LLMs at:
https://github.com/AI-in-Health/MedLLMsPracticalGuide

1 Introduction
The recently emerged general large language models (LLMs) 1,2 , such as PaLM 3 , LLaMA 4,5 , GPT-series 6,7 , and ChatGLM 8,9 ,
have advanced the state-of-the-art in various natural language processing (NLP) tasks, including text generation, text summa-
rization, and question answering. Inspired by these successes, several endeavors have been made to adapt general LLMs to the
medicine domain, leading to the emergence of medical LLMs 10,11 . For example, based on PaLM 3 and GPT-4 7 , MedPaLM-2 11
and MedPrompt 12 have respectively achieved a competitive accuracy of 86.5 and 90.2 compared to human experts (87.0 13 )
in the United States Medical Licensing Examination (USMLE) 14 . In particular, based on publicly available general LLMs
(e.g. LLaMA 4,5 ), a wide range of medical LLMs, including ChatDoctor 15 , MedAlpaca 16 , PMC-LLaMA 13 , BenTsao 17 , and
Clinical Camel 18 , have been introduced. As a result, medical LLMs have gained growing research interests in assisting medical
professionals to improve patient care 19,20 .
Although existing medical LLMs have achieved promising results, there are some key issues in their development and
application that need to be addressed. First, many of these models primarily focus on medical dialogue and medical question-
answering tasks, but their practical utility in clinical practice is often overlooked 19 . Recent research and reviews 19,21,22 have
begun to explore the potential of medical LLMs in different clinical scenarios, including Electronic Health Records (EHRs) 23 ,
Pre-training from Fine-tuning General Prompting General
Scratch LLMs LLMs
Principles: Principles:
Data Medical Pre-training Fine-tuning
Pipeline
(Section 2) Knowledge Bases Data Data
(Section 2)

Discriminative Generative Performance


Tasks Tasks Comparisons

Medical Medical Mental Health Clinical Report

·
Education Diagnosis Support Generation
Medical
Future Medical Tasks
Directions LLMs Medical Medical Language Formatting and
(Section 3)
(Section 6) Robotics Translation ICD Coding

New Knowledge Lack of Evaluation


Hallucination
Adaptation Benchmarks and Metrics

Behavior Regulatory Domain Data Ethical, Legal &


Clinical Alignment Safety Concerns
Challenges Challenges Limitations
Applications
(Section 5)
(Section 4)
New Multi-modal Medical Interdisciplinary
Benchmarks LLM Agents Collaborations

Figure 1. An overview of the practical guides for medical large language models.

discharge summary generation 20 , health education 24 , and care planning 11 . However, they primarily focus on presenting clinical
applications of LLMs, especially online commercial LLMs like ChatGPT (including GPT-3.5 and GPT-4 7 ), without providing
practical guidelines for the development of medical LLMs. Besides, they mainly perform case studies to conduct the human
evaluation on a small number of samples, thus lacking evaluation datasets for assessing model performance in clinical scenarios.
Second, most existing medical LLMs report their performances mainly on answering medical questions, neglecting other
biomedical domains, such as medical language understanding and generation. These research gaps motivate this review which
offers a comprehensive review of the development of LLMs and their applications in medicine. We aim to cover topics on
existing medical LLMs, various medical tasks, clinical applications, and arising challenges.
As shown in Figure 1, this review seeks to answer the following questions. Section 2: What are LLMs? How can medical
LLMs be effectively built? Section 3: How are the current medical LLMs evaluated? What capabilities do medical LLMs offer
beyond traditional models? Section 4: How should medical LLMs be applied in clinical settings? Section 5: What challenges
should be addressed when implementing medical LLMs in clinical practice? Section 6: How can we optimize the construction
of medical LLMs to enhance their applicability in clinical settings, ultimately contributing to medicine and creating a positive
societal impact?
For the first question, we analyze the foundational principles underpinning current medical LLMs, providing detailed
descriptions of their architecture, parameter scales, and the datasets used during their development. This exposition aims to
serve as a valuable resource for researchers and clinicians designing medical LLMs tailored to specific requirements, such as
computational constraints, data privacy concerns, and the integration of local knowledge bases. For the second question, we
evaluate the performance of medical LLMs across ten biomedical NLP tasks, encompassing both discriminative and generative
tasks. This comparative analysis elucidates how these models outperform traditional AI approaches, offering insights into
the specific capabilities that render LLMs effective in clinical environments. The third question, the practical deployment
of medical LLMs in clinical settings, is explored through the development of guidelines tailored for seven distinct clinical
application scenarios. This section outlines practical implementations, emphasizing specific functionalities of medical LLMs
that are leveraged in each scenario. The fourth question emphasizes addressing the challenges associated with the clinical
deployment of medical LLMs, such as the risk of generating factually inaccurate yet plausible outputs (hallucination), and
the ethical, legal, and safety implications. Citing recent studies, we argue for a comprehensive evaluation framework that
assesses the trustworthiness of medical LLMs to ensure their responsible and effective utilization in healthcare. For the last
question, we propose future research directions to advance the medical LLMs field. This includes fostering interdisciplinary
collaboration between AI specialists and medical professionals, advocating for a ’doctor-in-the-loop’ approach, and emphasizing
human-centered design principles.
By establishing robust training data, benchmarks, metrics, and deployment strategies through co-development efforts, we
aim to accelerate responsible and efficacious integration of medical LLMs into clinical practice. This study therefore seeks
to stimulate continued research and development in this interdisciplinary field, with the objective of realizing the profound
potential of medical LLMs in enhancing clinical practice and advancing medical science for for the betterment of society.

2/29
BOX 1: Background of Large Language Models (LLMs)
The impressive performance of LLMs can be attributed to Transformer-based language model’s performance as its size increases (in terms of parameters, layers, data, or the
models, large-scale pre-training, and scaling laws. amount of training computed). The scaling laws proposed by OpenAI 31 show that to
Language Models A language model 25,26,27 is a probabilistic model that models achieve optimal model performance, the budget allocation for model size should be
the joint probability distribution of tokens (meaningful units of text, such as words or larger than the data.
subwords or morphemes) in a sequence, i.e., the probabilities of how words and phrases The scaling laws proposed by Google DeepMind 32 show that both model and data sizes
are used in sequences. Therefore, it can predict the likelihood of a sequence of tokens should be increased in equal scales. The scaling laws guide researchers to allocate
given the previous tokens, which can be used to predict the next token in a sequence or resources and anticipate the benefits of scaling models.
to generate new sequences. General Large Language Models Existing general LLMs can be divided into three
The Transformer architecture The recurrent neural network (RNN) 28,26 has been categories based on their architecture (Table 1).
widely used for language modeling by processing tokens sequentially and maintaining Encoder-only LLMs consisting of a stack of Transformer encoder layers, employ a
a vector named hidden state that encodes the context of previous tokens. Nonetheless, bidirectional training strategy that allows them to integrate context from both the left and
sequential processing makes it unsuitable for parallel training and limits its ability to the right of a given token in the input sequence. This bi-directionality enables the models
capture long-range dependencies, making it computationally expensive and hindering to achieve a deep understanding of the input sentences 30 . Therefore, encoder-only
its learning ability for long sequences. The strength of the Transformer 29 lies in LLMs are particularly suitable for language understanding tasks (e.g., sentiment analysis
its fully attentive mechanism, which relies exclusively on the attention mechanism document classification) where the full context of the input is essential for accurate
and eliminates the need for recurrence. When processing each token, the attention predictions. BERT 30 and DeBERTa 33 are the representative encoder-only LLMs.
mechanism computes a weighted sum of the other input tokens, where the weights are Decoder-only LLMs utilize a stack of Transformer decoder layers and are characterized
determined by the relevance between each input token and the current token. It allows by their uni-directional (left-to-right) processing of text, enabling them to generate
the model to adaptively focus on different parts of the sequence to effectively learn language sequentially. This architecture is trained unidirectionally using the next token
the joint probability distribution of tokens. Therefore, Transformer not only enables prediction training objective to predict the next token in a sequence, given all the previous
efficient modeling of long-text but also allows highly paralleled training 30 , thus reducing tokens. After training, the decoder-only LLMs generate sequences autoregressively
training costs. They make the Transformer highly scalable, and therefore it is efficient to (i.e. token-by-token). The examples are the GPT-series developed by OpenAI 6,7 , the
obtain LLMs through the large-scale pre-training strategy. LLaMA-series developed by Meta 4,5 , and the PaLM 3 and Bard (Gemini) 34 developed
Large-scale Pre-training The LLMs are trained on massive corpora of unlabeled texts by Google. Based on the LLaMA model, Alpaca 35 is fine-tuned with 52k self-instructed
(e.g., CommonCrawl, Wiki, and Books) to learn rich linguistic knowledge and language data supervision. In addition, Baichuan 36 is trained on approximately 1.2 trillion tokens
patterns. The common training objectives are masked language modeling (MLM) and that support bilingual communication in Chinese and English. These LLMs have been
next token prediction (NTP). In MLM, a portion of the input text is masked, and the model used successfully in language generation.
is tasked with predicting the masked text based on the remaining unmasked context, Encoder-decoder LLMs are designed to simultaneously process input sequences and
encouraging the model to capture the semantic and syntactic relationships between generate output sequences. They consist of a stack of bidirectional Transformer encoder
tokens 30 ; NTP is another common training objective, where the model is required to layers followed by a stack of unidirectional Transformer decoder layers. The encoder
predict the next token in a sequence given the previous tokens. It helps the model to processes and understands the input sequences, while the decoder generates the output
predict the next token 6 . sequences 8,9,37 . Representative examples of encoder-decoder LLMs include Flan-T5 38 ,
Scaling Laws LLMs are the scaled-up versions of Transformer architecture 29 with and ChatGLM 8,9 . Specifically, ChatGLM 8,9 has 6.2B parameters and is a conversational
increased numbers of Transformer layers, model parameters, and volume of pre-training open-source LLM specially optimized for Chinese to support Chinese-English bilingual
data. The “scaling laws” 31,32 predict how much improvement can be expected in a question-answering.

Table 1. Summary of existing general (large) language models, their underlying structures, numbers of parameters,
and datasets used for model training. Column “# params” shows the number of parameters, M: million, B: billion.

Domains Model Structures Models # Params Pre-train Data Scale


BERT 30 110M/340M 3.3B tokens
Encoder-only RoBERTa 39 355M 161GB
DeBERTa 33 1.5B 160GB
GPT-2 40 1.5B 40GB
Vicuna 41 7B/13B LLaMA + 70K dialogues
Alpaca 35 7B/13B LLaMA+ 52K IFT
Mistral 42 7B -
LLaMA 4 7B/13B/33B/65B 1.4T tokens
LLaMA-2 5 7B/13B/34B/70B 2T tokens
LLaMA-3 43 8B/70B 15T tokens
General-domain Decoder-only GPT-3 6 6.7B/13B/175B 300B tokens
(Large) Language Models
Qwen 44 1.8B/7B/14B/72B 3T tokens
PaLM 3 8B/62B/540B 780B tokens
FLAN-PaLM 37 540B -
Gemini (Bard) 34 - -
GPT-3.5 45 - -
GPT-4 7 - -
Claude-3 46 - -
BART 47 140M/400M 160GB
ChatGLM 8,9 6.2B 1T tokens
Encoder-Decoder T5 38 11B 1T tokens
FLAN-T5 37 3B/11B 780B tokens
UL2 48 19.5B 1T tokens
GLM 9 130B 400B tokens

2 The Principles of Medical Large Language Models


Box 1 and Table 1 briefly introduce the background of general LLMs 1 , e.g., GPT-4 7 . Table 2 summarizes the currently
available medical LLMs according to their model development. Existing medical LLMs are mainly pre-trained from scratch,
fine-tuned from existing general LLMs, or directly obtained through prompting to align the general LLMs to the medical
domain. Therefore, we introduce the principles of medical LLMs in terms of these three methods: pre-training, fine-tuning, and
prompting. Meanwhile, we further summarize the medical LLMs according to their model architectures in Figure 2.

3/29
Table 2. Summary of existing medical-domain LLMs, in terms of their model development, the number of parameters (#
params), the scale of pre-training/fine-tuning data, and the data source. M: million, B: billion.
Domains Model Development Models # Params Data Scale Data Source
BioBERT 49 110M 18B tokens PubMed 50 +PMC 51
PubMedBERT 52 110M/340M 3.2B tokens PubMed 50 +PMC 51
SciBERT 53 110M 3.17B tokens Literature 54
ClinicalBERT 55 110M 112k clinical notes MIMIC-III 56
BioM-ELECTRA 57 110M/335M - PubMed 50
BioMed-RoBERTa 58 125M 7.55B tokens S2ORC 59
Pre-training BioLinkBERT 60 110M/340M 21GB PubMed 50
(Sec. 2.1)
SciFive 61 220M/770M - PubMed 50 +PMC 51
ClinicalT5 62 220M/770M 2M clinical notes MIMIC-III 56
BlueBERT 63,64,65 110M/340M >4.5B tokens PubMed 50 +MIMIC-III 56
MedCPT 66 330M 255M articles PubMed 50
BioGPT 67 1.5B 15M articles PubMed 50
BioMedLM 68 2.7B 110GB Pile 69
OphGLM 70 6.2B 20k dialogues MedDialog 71

GatorTron 23 8.9B >82B tokens+6B tokens EHRs 23 +PubMed 50


2.5B tokens+0.5B tokens Wiki+MIMIC-III 56
GatorTronGPT 72 5B/20B 277B tokens EHRs 72
DoctorGLM 73 6.2B 323MB dialogues CMD. 74
BianQue 75 6.2B 2.4M dialogues BianQueCorpus 75
Medical-domain 96k EHRs MD-EHR 76
LLMs ClinicalGPT 76 7B 192 medical QA VariousMedQA 14
(Sec. 2) 100k dialogues MedDialog 71
Qilin-Med 77 7B 3GB ChiMed 77
ChatDoctor 15 7B 110k dialogues HealthCareMagic 78 +iCliniq 79
BenTsao 17 7B 8k instructions CMeKG-8K 80
HuatuoGPT 81 7B 226k instructions&dialogues Hybrid SFT 81
Baize-healthcare 82 7B 101K dialogues Quora+MedQuAD 83
BioMedGPT 84 7B >26B tokens S2ORC 59
Fine-tuning
(Sec. 2.2) MedAlpaca 16 7B/13B 160k medical QA Medical Meadow 16
AlpaCare 85 7B/13B 52k instructions MedInstruct-52k 85
Zhongjing 86 13B 70k dialogues CMtMedQA 86
PMC-LLaMA 13 13B 79.2B tokens Books+Literature 59 +MedC-I 13
CPLLM 87 13B 109k EHRs eICU-CRD 88 +MIMIC-IV 89
OpenBioLLM 90 8B/70B - -
MEDITRON 91,92 7B/70B 48.1B tokens PubMed 50 +Guidelines 91

Clinical Camel 18 13B/70B 70k dialogues+100k articles ShareGPT 93 +PubMed 50


4k medical QA MedQA 14
MedPaLM 2 11 340B 193k medical QA MultiMedQA 11

Med-Gemini 94,95 - - MedQA-R&RS 95 ++MultiMedQA 11


+MIMIC-III 56 +MultiMedBench 96
97 98
CodeX GPT-3.5 / LLaMA-2 Chain-of-Thought (CoT) -
DeID-GPT 99 ChatGPT / GPT-4 Chain-of-Thought (CoT) 98 -
ChatCAD 100 ChatGPT Zero-shot Prompting -
Dr. Knows 101 ChatGPT Zero-shot Prompting UMLS 102
Prompting MedPaLM 10 PaLM (540B) 40 instructions MultiMedQA 11
(Sec. 2.3)
MedPrompt 12 GPT-4 Few-shot & CoT 98 -

Chat-Orthopedist 103
ChatGPT Retrieval-Augmented PubMed+Guidelines 104 +
Generation (RAG) UpToDate 105 +Dyname 106
107
QA-RAG ChatGPT RAG FDA QA 107
Almanac 108 ChatGPT RAG & CoT Clinical QA 108

2.1 Pre-training
Pre-training typically involves training an LLM on a large corpus of medical texts, including both structured and unstructured
text, to learn the rich medical knowledge. The corpus may include EHRs 72 , clinical notes 23 , and medical literature 55 . In
particular, PubMed 50 , MIMIC-III clinical notes 56 , and PubMed Central (PMC) literature 51 , are three widely used medical
corpora for medical LLM pre-training. A single corpus or a combination of corpora may be used for pre-training. For example,
PubMedBERT 52 and ClinicalBERT are pre-trained on PubMed and MIMIC-III, respectively. In contrast, BlueBERT 63
combines both corpora for pre-training; BioBERT 49 is pre-trained on both PubMed and PMC. The University of Florida (UF)
Health EHRs are further introduced in pre-training GatorTron 23 and GatorTronGPT 72 . MEDITRON 91 is pre-trained on Clinical
Practice Guidelines (CPGs). The CPGs are used to guide healthcare practitioners and patients in making evidence-based
decisions about diagnosis, treatment, and management.
To meet the needs of the medical domain, pre-training medical LLMs typically involve refining the following commonly
used training objectives in general LLMs: masked language modeling, next sentence prediction, and next token prediction
(Please see Box 1 for an introduction of these three pre-training objectives). For example. BERT-series models (e.g., BioBERT 49 ,
PubMedBERT 52 , ClinicalBERT 55 , and GatorTron 23 ) mainly adopt the masked language modeling and the next sentence
prediction for pre-training; GPT-series models (e.g., BioGPT 67 , and GatorTronGPT 72 ) mainly adopt the next token prediction

4/29
10000
540B

175B MedPaLM
1000

70B GPT 3 MedPaLM 2

Clinical
100 Camel
Scaling up
GatorTron
GPT

8.9B
MedAlp AlpaCar PMC-
aca e LLaMA
10 1.5B Huatuo Clinical
BioMed GPT GPT
GatorTron ChatDoc Qilin-
BenTsao LM
BianQue tor Med
GPT 2 BioGPT

OphGL
1 M
BlueBE PubMed Med
RT BERT CPT
Clinic
BioB SciB
alBE
ERT ERT
0.1 RT

0.01
BERT-like ChatGLM/LLaMA-like GPT/PaLM-like

Figure 2. We adopt the data from Table 2 to demonstrate the development of model sizes for medical large language models
in different model architectures, i.e., BERT-like, ChatGLM/LLaMA-like, and GPT/PaLM-like.

for pre-training. It is worth mentioning that BERT-like Medical LLMs (e.g. BioBERT 49 , PubMedBERT 52 , Clinical BERT 55 )
are originally derived from the general domain BERT or RoBERTa models. To clarify the differences between different models,
in our Table 2, we only show the data source used to further construct medical LLMs. After pre-training, medical LLMs can
learn rich medical knowledge that can be leveraged to achieve strong performance on different medical tasks.

2.2 Fine-tuning
It is high-cost and time-consuming to train a medical LLM from scratch, due to its requirement of substantial (e.g. several
days or even weeks) computational power and manual labor. One solution is to fine-tune the general LLMs with medical
data, and researchers have proposed different fine-tuning methods 11,16,18 for learning domain-specific medical knowledge and
obtaining medical LLMs. Current fine-tuning methods include Supervised Fine-Tuning (SFT), Instruction Fine-Tuning (IFT),
and Parameter-Efficient Fine-Tuning (PEFT). The resulting fine-tuned medical LLMs are summarized in Table 2.
Supervised Fine-Tuning (SFT) aims to leverage high-quality medical corpus, which can be physician-patient conversations 15 ,
medical question-answering 16 , and knowledge graphs 77,17 .The constructed SFT data serves as a continuation of the pre-training
data to further pre-train the general LLMs with the same training objectives, e.g. next token prediction. SFT provides an
additional pre-training phase that allows the general LLMs to learn rich medical knowledge and align with the medical domain,
thus transforming them into specialized medical LLMs.
The diversity of SFT enables the development of diverse medical LLMs by training on different types of medical corpus.
For example, DoctorGLM 73 and ChatDoctor 15 are obtained by fine-tuning the general LLMs ChatGLM 8,9 and LLaMA 4 on
the physician-patient dialogue data, respectively. MedAlpaca 16 based on the general LLM Alpaca 35 is fine-tuned using over
160,000 medical QA pairs sourced from diverse medical corpora. Clinicalcamel 18 combines physician-patient conversations,
clinical literature, and medical QA pairs to refine the LLaMA-2 model 5 . In particular, Qilin-Med 77 and Zhongjing 86 are
obtained by incorporating the knowledge graph to perform fine-tuning on the Baichuan 36 and LLaMA 4 , respectively.
In summary, existing studies have demonstrated the efficacy of SFT in adapting general LLMs to the medical domain. They
show that SFT improves not only the model’s capability for understanding and generating medical text, but also its ability to
provide accurate clinical decision support 109 .
Instruction Fine-Tuning (IFT) constructs instruction-based training datasets 110,109,1 , which typically comprise instruction-
input-output triples, e.g. instruction-question-answer. The primary goal of IFT is to enhance the model’s ability to follow
various human/task instructions, align their outputs with the medical domain, and thereby produce a specialized medical LLM.
Thus, the main difference between SFT and IFT is that the former focuses primarily on injecting medical knowledge into a
general LLM through continued pre-training, thus improving its ability to understand the medical text and accurately predict
the next token. In contrast, IFT aims to improve the model’s instruction following ability and adjust its outputs to match the
given instructions, rather than accurately predicting the next token as in SFT 110 . As a result, SFT emphasizes the quantity
of training data, while IFT emphasizes their quality and diversity. Since IFT and SFT are both capable of improving model
performance, there have been some recent works 86,77,85 attempting to combine them for obtaining robust medical LLMs.

5/29
In other words, to enhance the performance of LLMs through IFT, it is essential to ensure that the training data for IFT are
of high quality and encompass a wide range of medical instructions and medical scenarios. To this end, MedPaLM-2 11 invited
qualified medical professionals to develop the instruction data for fine-tuning the general PaLM. BenTsao 17 and ChatGLM-
Med 111 constructed the knowledge-based instruction data from the knowledge graph. Zhongjing 86 further incorporated the
multi-turn dialogue as the instruction data to perform IFT. MedAlpaca 16 simultaneously incorporated the medical dialogues
and medical QA pairs for instruction fine-tuning.
Parameter-Efficient Fine-Tuning (PEFT) aims to substantially reduce computational and memory requirements for fine-
tuning general LLMs. The main idea is to keep most of the parameters in pre-trained LLMs unchanged, by fine-tuning only the
smallest subset of parameters (or additional parameters) in these LLMs. Commonly used PEFT techniques include Low-Rank
Adaptation (LoRA) 112 , Prefix Tuning 113 , and Adapter Tuning 114,115 .
In contrast to fine-tuning full-rank weight matrices, 1) LoRA preserves the parameters of the original LLMs and only adds
trainable low-rank matrices into the self-attention module of each Transformer layer 112 . Therefore, LoRA can substantially
reduce the number of trainable parameters and improve the efficiency of fine-tuning, while still enabling the fine-tuned LLM to
capture effectively the characteristics of the tasks. 2) Prefix Tuning takes a different approach from LoRA by adding a small
set of continuous task-specific vectors (i.e. “prefixes”) to the input of each Transformer layer 113,1 . These prefixes serve as
the additional context to guide the generation of the model without changing the original pre-trained parameter weights. 3)
Adapter Tuning involves introducing small neural network modules, known as adapters, into each Transformer layer of the
pre-trained LLMs 116 . These adapters are fine-tuned while keeping the original model parameters frozen 116 , thus allowing for
flexible and efficient fine-tuning. The number of trainable parameters introduced by adapters is relatively small, yet they enable
the LLMs to adapt to clinical scenarios or tasks effectively.
In general, PEFT is valuable for developing LLMs that meet unique needs in specific (e.g. medical) domains, due
to its ability to reduce computational demands while maintaining the model performance. For example, medical LLMs
DoctorGLM 73 , MedAlpaca 16 , Baize-Healthcare 82 , Zhongjing 86 , CPLLM 87 , and Clinical Camel 18 adopted the LoRA 112 to
perform parameter-efficient fine-tuning to efficiently align the general LLMs to the medical domain.

2.3 Prompting
Fine-tuning considerably reduces computational costs compared to pre-training, but it requires further model training and
collections of high-quality datasets for fine-tuning, thus still consuming some computational resources and manual labor. In
contrast, the “prompting” methods efficiently align general LLMs (e.g. PaLM 3 ) to the medical domain (e.g., MedPaLM 10 ),
without training any model parameters. Popular prompting methods include In-Context Learning (ICL), Chain-of-Thought
(CoT) prompting, Prompt Tuning, and Retrieval-Augmented Generation (RAG).
In-Context Learning (ICL) aims to directly give instructions to prompt the LLM to perform a task efficiently. In general, the
ICL consists of four process: task understanding, context learning, knowledge reasoning, and answer generation. First, the
model must understand the specific requirements and goals of the task. Second, the model learns to understand the contextual
information related to the task with argument context. Then, use the model’s internal knowledge and reasoning capabilities to
understand the patterns and logic in the example. Finally, the model generates the task-related answers. The advantage of ICL is
that it does not require a large amount of labeled data for fine-tuning. Based on the type and number of input examples, ICL can
be divided into three categories 117 . 1) One-shot Prompting: Only one example and task description are allowed to be entered.
2) Few-shot Prompting: Allows the input of multiple instances and task descriptions. 3) Zero-shot Prompting: Only task
descriptions are allowed to be entered. ICL presents the LLMs making task predictions based on contexts augmented with a few
examples and task demonstrations. It allows the LLMs to learn from these examples or demonstrations to accurately perform
the task and follow the given examples to give corresponding answers 6 . Therefore, ICL allows LLMs to accurately understand
and respond to medical queries. For example, MedPaLM 10 substantially improves the task performance by providing the
general LLM, PaLM 3 , with a small number of task examples such as medical QA pairs.
Chain-of-Thought (CoT) Prompting further improves the accuracy and logic of model output, compared with In-Context
Learning. Specifically, through prompting words, CoT aims to prompt the model to generate intermediate steps or paths of
reasoning when dealing with downstream (complex) problems 98 . Moreover, CoT can be combined with few-shot prompting by
giving reasoning examples, thus enabling medical LLMs to give reasoning processes when generating responses. For tasks
involving complex reasoning, such as medical QA, CoT has been shown to effectively improve model performance 10,11 . Medical
LLMs, such as DeID-GPT 99 , MedPaLM 10 , and MedPrompt 12 , use CoT prompting to assist them in simulating a diagnostic
thought process, thus providing more transparent and interpretable predictions or diagnoses. In particular, MedPrompt 12
directly prompts a general LLM, GPT-4 7 , to outperform the fine-tuned medical LLMs on medical QA without training any
model parameters.

6/29
Prompt Tuning aims to improve the model performance by employing both prompting and fine-tuning techniques 118,115 . The
prompt tuning method introduces learnable prompts, i.e. trainable continuous vectors, which can be optimized or adjusted
during the fine-tuning process to better adapt to different medical scenarios and tasks. Therefore, they provide a more flexible
way of prompting LLMs than the “prompting alone” methods that use discrete and fixed prompts, as described above. In contrast
to traditional fine-tuning methods that train all model parameters, prompt tuning only tunes a very small set of parameters
associated with the prompts themselves, instead of extensively training the model parameters. Thus, prompt tuning effectively
and accurately responds to medical problems 12 , with minimal incurring computational cost.
Existing medical LLMs that employ the prompting techniques are listed in Table 2. Recently, MedPaLM 10 and MedPaLM-
11
2 propose to combine all the above prompting methods, resulting in Instruction Prompt Tuning, to achieve strong performances
on various medical question-answering datasets. In particular, using the MedQA dataset for the US Medical Licensing
Examination (USMLE), MedPaLM-2 11 achieves a competitive overall accuracy of 86.5% compared to human experts (87.0%),
surpassing previous state-of-the-art method MedPaLM 10 by a large margin (19%).
Retrieval-Augmented Generation (RAG) enhances the performance of LLMs by integrating external knowledge into the
generation process. In detail, RAG can be used to minimize LLM’s hallucinations, obscure reasoning processes, and reliance
on outdated information by incorporating external database knowledge 119 . RAG consists of three main components: retrieval,
augmentation, and generation. The retrieval component employs various indexing strategies and input query processing
techniques to search and top-ranked relevant information from an external knowledge base. The retrieved external data is then
augmented into the LLM’s prompt, providing additional context and grounding for the generated response. By directly updating
the external knowledge base, RAG mitigates the risk of catastrophic forgetting associated with model weight modifications,
making it particularly suitable for domains with low error tolerance and rapidly evolving information, such as the medical
field. In contrast to traditional fine-tuning methods, RAG enables the timely incorporation of new medical information without
compromising the model’s previously acquired knowledge, ensuring the generated outputs remain accurate and up-to-date
in the face of evolving medical challenges. Most recently, researchers proposed the first benchmark MIRAGE 120 based on
medical information RAG, including 7,663 questions from five medical QA datasets, which has been established to both steer
research and facilitate the practical deployment of medical RAG systems
In RAG, retrieval can be achieved by calculating the similarity between the embeddings of the question and document
chunks, where the semantic representation capability of embedding models plays a key role. Recent research has introduced
prominent embedding models such as AngIE 121 , Voyage 122 , and BGE 123 . In addition to embedding, the retrieval process
can be optimized via various strategies such as adaptive retrieval, recursive retrieval, and iterative retrieval 124,125,126 . Several
recent works have demonstrated the effectiveness of RAG in medical and pharmaceutical domains. Almanac 108 is a large
language framework augmented with retrieval capabilities for medical guidelines and treatment recommendations, surpassing
the performance of ChatGPT on clinical scenario evaluations, particularly in terms of completeness and safety. Another work
QA-RAG 107 employs RAG with LLM for pharmaceutical regulatory tasks, where the model searches for relevant guideline
documents and provides answers based on the retrieved guidelines. Chat-Orthopedist 103 , a retrieval-augmented LLM, assists
adolescent idiopathic scoliosis (AIS) patients and their families in preparing for meaningful discussions with clinicians by
providing accurate and comprehensible responses to patient inquiries, leveraging AIS domain knowledge.

2.4 Discussion
This section discusses the principles of medical LLMs, including three types of methods for building models: pre-training,
fine-tuning, and prompting. To meet the needs of practical medical applications, users can choose proper medical LLMs
according to the magnitude of their own computing resources. Companies or institutes with massive computing resources
can either pre-train an application-level medical LLM from scratch or fine-tune existing open-source general LLM models
(e.g. LLaMA 43 ) using large-scale medical data. The results in existing literature (e.g. Med-PaLM2 11 , MedAlpaca 16 and
Clinical Camel 18 ) have shown that fine-tuning general LLMs on medical data can boost their performance of medical tasks.
For example, Clinical Camel 18 , which is fine-tuned on the LLaMA-2-70B 5 model, even outperforms GPT-4 18 . However, for
small enterprises or individuals with certain computing resources, combining with the understanding of medical tasks and a
reasonable combination of ICL, prompting engineering, and RAG, to prompt black-box LLMs may also achieve miraculous
results. For example, MedPrompt 12 stimulates the commercial LLM GPT-4 7 through an appropriate combination of prompt
strategies to achieve comparable or even better results than fine-tuned medical LLMs (e.g. Med-PaLM2 11 ) and human experts,
suggesting that a mix of prompting strategies is an efficient and green solution in the medical domain rather than fine-tuning.

3 Medical Tasks
In this section, we will introduce two popular types of medical machine learning tasks: generative and discriminative tasks,
including ten representative tasks that further build up clinical applications. Figure 3 illustrates the performance comparisons

7/29
Medical Large Language Model for
Medical Tasks
USMLE - Acc.
(Question Answering)
88
TREC-COVID - NDCG@10 PubMedQA - Acc.
(Information Retrieval) (Question Answering)
77 81
85

80 66
72
BIOSSES - F1 MedMCQA - Acc.
(Semantic Textual Similarity) 75 (Question Answering)
55 84
94 63 72
70
92 60
44
48
90 65 36
88
50 60 70 80
36 72 90
39 NCBI Disease - F1
NFCorpus - NDCG@10 42 50 (Entity Extraction)
(Information Retrieval) 81 78

84
63 60
84 90
70 70 96
87
MedNLI - F1 BC5CDR Drug/Chem. - F1
77 80 (Entity Extraction)
(Natural Language Inference)

84 90
MIMIC-III - F1 DDI - F1
(Text Classification) (Relation Extraction)
GPT-3.5-turbo GPT-4 Fine-tuned Human Expert

Figure 3. Performance (Dataset-Metric (Task)) comparison between the GPT-3.5 turbo, GPT-4, state-of-the-art task-specific
lightweight models (Fine-tuned), and human experts, on seven medical tasks across eleven datasets. All data presented in our
Figures originates from published and peer-reviewed literature. Please refer to the supplementary material for the detailed data.

between different LLMs. For clarity, we will only cover a general discussion of those tasks. The detailed definition of the task
and the performance comparisons can be found in our supplementary material.

3.1 Discriminative Tasks


Discriminative tasks are for categorizing or differentiating data into specific classes or categories based on given input data.
They involve making distinctions between different types of data, often to categorize, classify, or extract relevant information
from structured text or unstructured text. The representative tasks include Question Answering, Entity Extraction, Relation
Extraction, Text Classification, Natural Language Inference, Semantic Textual Similarity, and Information Retrieval.
The typical input for discriminative tasks can be medical questions, clinical notes, medical documents, research papers, and
patient EHRs. The output can be labels, categories, extracted entities, relationships, or answers to specific questions, which are
often structured and categorized information derived from the input text. In existing LLMs, the discriminative tasks are widely
studied and used to make predictions and extract information from input text.
For example, based on medical knowledge, medical literature, or patient EHRs, the medical question answering (QA)
task can provide precise answers to clinical questions, such as symptoms, treatment options, and drug interactions. This can
help clinicians make more efficient and accurate diagnoses 10,11,19 . Entity extraction can automatically identify and categorize
critical information (i.e. entities) such as symptoms, medications, diseases, diagnoses, and lab results from patient EHRs, thus
assisting in organizing and managing patient data. The following entity linking task aims to link the identified entities in a
structured knowledge base or a standardized terminology system, e.g., SNOMED CT 127 , UMLS 102 , or ICD codes 128 . This
task is critical in clinical decision support or management systems, for better diagnosis, treatment planning, and patient care.
Performance Comparisons Figure 3 shows that some existing general LLMs (e.g. GPT-3.5-turbo and GPT-4 7 ) have achieved
strong performance on existing medical machine learning tasks. This is most noticeable for the QA task where GPT-4 (shown
in the blue line in Figure 3) consistently outperforms existing task-specific fine-tuned models and is even comparable to
human experts (shown in the purple line). The QA datasets of evaluation include MedQA (USMLE) 14 , PubMedQA 129 , and
MedMCQA 130 . To better understand the QA performance of existing medical LLMs, in Figure 4, we further demonstrate the
QA performance of medical LLMs on the MedQA dataset over time in different model development types. It also clearly shows
that current works, e.g., MedPrompt 12 , have successfully proposed several prompting methods to enable LLMs to outperform
human experts.
However, on the non-QA tasks, as shown in Figure 3, the existing general LLMs perform worse than the task-specific
fine-tuned models. For example, on the entity extraction task using the NCBI disease dataset 131 , the state-of-the-art task-specific

8/29
Prompting

Pre-training

Fine-tuning
Human
(expert)

Figure 4. We demonstrate the development of medical large language models over time in different model development types
through the scores of the United States Medical Licensing Examination (USMLE) from the MedQA dataset.

fine-tuned model BioBERT 49 achieves an F1 score of 89.36, substantially exceeding the F1 score of 56.73 by GPT-4. We
hypothesize that the reason for the strong QA capability of the current general LLMs is that the QA task is close-ended; i.e. the
correct answer is already provided by multiple candidates. In contrast, most non-QA tasks are open-ended where the model has
to predict the correct answer from a large pool of possible candidates, or even without any candidates provided.
Overall, the comparison proves that the current general LLMs have a strong question-answering capability, however, the
capability on other tasks still needs to be improved. Therefore, we advocate that the evaluation of medical LLMs should be
extended to a broad range of tasks including non-QA tasks, instead of being limited mainly to medical QA tasks. Hereafter, we
will discuss specific clinical applications of LLMs, followed by their challenges and future directions.

3.2 Generative Tasks


Different from discriminative tasks that focus on understanding and categorizing the input text, generative tasks require
a model to accurately generate fluent and appropriate new text based on given inputs. These tasks include medical text
summarization 132,133 , medical text generation 67 , and medical text simplification 134 .
For medical text summarization, the input and output are typically long and detailed medical text (e.g. “Findings” in
radiology reports), and a concise summarized text (e.g., the “Impression” in radiology reports). Such text contains important
medical information that enables clinicians and patients to efficiently capture the key points without going through the entire
text. It can also help medical professionals to draft clinical notes by summarizing patient information or medical histories.
In medical text generation, e.g. discharge instruction generation 135 , the input can be medical conditions, symptoms, patient
demographics, or even a set of medical notes or test results. The output can be a diagnosis recommendation of a medical
condition, personalized instructional information, or health advice for the patient to manage their condition outside the hospital.
Medical text simplification 134 aims to generate a simplified version of the complex medical text by, for example, clarifying
and explaining medical terms. Different from text summarization, which concentrates on giving shortened text while maintaining
most of the original text meanings, text simplification focuses more on the readability part. In particular, complicated or opaque
words will be replaced; complex syntactic structures will be improved; and rare concepts will be explained 38 . Thus, one
example application is to generate easy-to-understand educational materials for patients from complex medical texts. It is
useful for making medical information accessible to a general audience, without altering the essential meaning of the texts.

4 Clinical Applications
As shown in Figure 5, this section discusses the clinical applications of LLMs. Each subsection contains a specific application
and discusses how LLMs perform this task. Table 3 summarizes the guidelines on how to select, build, and evaluate medical
LLMs for various clinical applications.

9/29
Clinical Applications

Medical Diagnosis Formatting and ICD-Coding Clinical Report Generation


Admission to Hospital Screening and Examination Text Summarization
Radiology Report
Clinical Codes
Discharge Summary
Chief Complaint: Acute appendicitis K35.80 (Unspecified
Major Surgical or Invasive Procedure: Appendectomy acute appendicitis) LLM
Chief Complaint: Intermittent chest pain and breathlessness. Events in Last 24 Past Medical History: Hypothyroidism, controlled with E03.9 (Hypothyroidism,
Hours: Overnight observation confirmed tachycardia; condition improved with medication unspecified)
medication. Assessment: A 72-year-old male with HTN and DM presented with Generation with Medical Images
Discharge Medications: Antibiotics for post- T81.4 (Infection
cardiac symptoms. Possible ACS and AF are suspected. operative care, continuation of Levothyroxine following a procedure)
Discharge Diagnosis: Acute appendicitis with localized Z87.19 (Personal Findings: Hyperlucent hyperinflated
peritonitis history of other lungs with flattened diaphragms.
What are the most likely diagnoses? Condition on Discharge: Patient stable with pain mana diseases of the Granulomas. Small sized heart. Minimal
ged, no signs of infection, wound healing appropriately Automated Coding System digestive system)
apical capping slightly greater at the left.
Acute Coronary Syndrome (ACS); Follow-up Recommendations: Post-operative follow-
Atrial Fibrillation (AF); up in one week, thyroid function test in three months,
LLM Indication: Pressure left-sided face.
Myocardial Infarction (MI); monitor for signs of infection
Pulmonary Embolism (PE)

Medical Robotics Medical Language Translation

Ultrasound Scanning

LLM

Inquiry, Interaction, Instruction

Multi-Agent Planner for Surgery Translation to Other Language Translation to Lay Person

Medical Education Mental Health Support Medical Inquiry and Response

Natural Language Inquiry Interaction


Case
Exam Preparation Explanations Understanding
Study
Response Review
1 Patient: I’ve been feeling really down.
I was hospitalised for an acute exacerbation of COPD and am now stable
Scenario and ready to be discharged. What should I pay attention to at home? What
Translation can I do to prevent future exacerbations?
Generation 2
“Therapist": I'm sorry to hear that. Can you Discharge Instructions:
tell me more? Attend follow-up appointments regularly to monitor your condition;
Question & Answer Practical Problems Exercise moderately (e.g., walking, tai chi), but avoid overexertion;
LLM 3 Get vaccinated against influenza and pneumonia to prevent infections;
Avoid exposure to irritants like smoke and dust, keep indoor air well-
ventilated;
Answer Natural Language Maintain a balanced diet, control weight.
Personalized Study Tutoring Patient: I feel like I'm not good enough... Generation If symptoms worsen, seek medical attention promptly. If you have any
Assessment questions, please consult your doctor.

Figure 5. Integrated overview of potential applications 101,136,137,138,139 of large language models in medicine.

4.1 Medical Diagnosis


Medical diagnosis involves the medical practitioner using objective medical data from tests and self-described subjective
symptoms to conclude the most likely health problem occurring in the patient 182 . This heavily relies on the synthesis and
interpretation of vast amounts of information from various sources, including patient medical histories, clinical data, and
the latest medical literature. The advent of large language models has opened up new opportunities for enhancing medical
diagnostic processes. These advanced natural language processing models can rapidly process and comprehend massive
volumes of medical data, literature, and legal guidelines, potentially aiding healthcare professionals in making more informed
and legally sound diagnostic decisions 182,19 .
Guideline Dr. Knows 101 extracts knowledge graph from the unified medical language system (UMLS) 102 for identifying
likely diagnoses from patient medical records. It is then further fine-tuned T5 models using these diagnoses as prompts
and leveraged zero-shot prompting for ChatGPT, demonstrating improved diagnosis prediction accuracy by utilizing the
knowledge graph. An alternative is DDx PaLM-2 140 , which builds on Google’s PaLM-2 and is further fine-tuned with medical
datasets for zero-shot prompting. It can engage in conversations and collaborate with clinicians, allowing them to interactively
identify potential diagnoses for complex medical conditions while improving the doctors’ diagnostic reasoning abilities
through the interactive process. Recent advancement in multimodal LLMs allows AI to understand more complex medical
knowledge and data with multimodality 183 . Notable examples include Med-Flamingo 141 , LLaVA-Med 144 , and Med-Gemini 95 .
Med-Flamingo 141 builds on Google’s vision-language model Flamingo and is pret-rained with medical image-text data.
LLaVA-Med 144 , proposed by Microsoft, builds on the LLaVA model and is trained with a two-stage medical concept alignment
and medical instruction tuning using 660k samples. Google’s powerful Med-Gemini 95 builds upon the Gemini foundation
and is further fine-tuned with large-scale medical multimodal data, highlighting its capabilities in multimodal understanding
and long-context processing. In training and evaluating these models, commonly utilized datasets include the MIMIC series,
such as MIMIC-III 56 and MIMIC-IV 89 discharge summaries, as well as question answering datasets such as MultiMedQA 184 ,
which constructed from various resources covering patient cases from PubMed articles and textbooks. For multimodal data,

10/29
Table 3. Summary of existing medical LLMs tailored to various clinical applications, in terms of their architecture, model
development, the number of parameters, the scale of PT/FT data, and the data source. M: million, B: billion. PT: Pre-training.
FT: Fine-tuning. ICL: In-Context Learning. CoT: Chain-of-Thought prompting. RAG: Retrieval-Augmented Generation.
Application Model Architecture Model Development # Params Data Scale Data Source

Dr. Knows 101 GPT-3.5 ICL 154B 5820 notes MIMIC-III 56 +IN-HOUSE 101
DDx PaLM-2 140 PaLM-2 FT & ICL 340B - MultiMedQA 11 +MIMIC-III 56

Medical MedQA-R/RS 95 +MultiMedQA 11


Med-Gemini 95 Gemini FT & CoT - -
Diagnosis MIMIC-III 56 +MultiMedBench 96
(Sec. 4.1)
Multimodal Textbook 141 +PMC-OA 141
Med-Flamingo 141 ViT+LLaMA-7B FT - 600k pairs
VQA-RAD 142 +PathVQA 143
PMC-15M 144 +VQA-RAD 142
LLaVA-Med 144 ViT+LLaMA-7B FT - 660k pairs
SLAKE 145 +PathVQA 143
PLM-ICD 146 RoBERTa FT 355M 70,539 notes MIMIC-II 147 +MIMIC-III 56
Formatting & DRG-LLaMA 148 LLaMA-7B FT 7B 25k pairs MIMIC-IV 149
ICD-Coding
150
(Sec. 4.2) ChatICD ChatGPT ICL - 10k pairs MIMIC-III 56
151
LLM-codex ChatGPT+LSTM ICL - - MIMIC-III 56
152
ImpressionGPT ChatGPT ICL & RAG 110M 184k reports MIMIC-CXR 149 +IU X-ray 153
154
Clinical Report RadAdapt T5 FT 223M, 738M 80k reports MIMIC-III 56
Generation
(Sec. 4.3) ChatCAD 100 GPT-3 ICL 175B 300 reports MIMIC-CXR 149
MAIRA-1 155 ViT+Vicuna-7B FT 8B 337k pairs MIMIC-CXR 149
RadFM 156 ViT+LLaMA-13B PT& FT 14B 32M pairs MedMD 156
SuFIA 157 GPT-4 ICL - 4 tasks ORBIT-Surgical 158
Medical
Robotics UltrasoundGPT 159 GPT-4 ICL - 522 tasks -
(Sec. 4.4)
Robotic X-ray 160 GPT-4 ICL - - -

161 PubMed 50 +EMEA 162


Medical mT5 T5 PT 738M, 3B 4.5B pairs
ClinicalTrials 163 , etc.
164
Medical Language Apollo Qwen PT & FT 1.8B-7B 2.5B pairs ApolloCorpora 164
Translation BiMediX 165 Mistral FT 13B 1.3M pairs BiMed1.3M 165
(Sec. 4.5)
Biomed-sum 166 BART FT 406M 27k papers BioCiteDB 166
RALL 167 BART FT & RAG 406M 63k pairs CELLS 166
ChatGPT 168 GPT-3.5/GPT-4 ICL - - -
Medical
Education 95 MedQA-R/RS 95 +MultiMedQA 11
(Sec. 4.6) Med-Gemini Gemini FT & CoT - -
MIMIC-III 56 +MultiMedBench 96
169
PsyChat ChatGLM FT 6B 350k pairs Xingling 169 +Smilechat 169
170
Mental Health ChatCounselor Vicuna FT 7B 8k instructions Psych8K 170
Support
(Sec. 4.7) Dreaddit 172 +DepSeverity 173 +SDCNL 174
Mental-LLM 171 Alpaca, FLAN-T5 FT & ICL 7B, 11B 31k pairs CSSRS-Suicide 175 +Red-Sam 176
Twt-60Users 177 +SAD 178
MedQA 14 +MultiMedBench 96
AMIE 179 PaLM2 FT 340B >2M pairs
Medical Inquiry MIMIC-III 56 +real-world diaglogue 179
and Response
(Sec. 4.8) Healthcare Copilot 180 ChatGPT ICL - - MedDialog 180
181
Conversational Diagnosis GPT-4/LLaMA ICL - 40k pairs MIMIC-IV 89

VQA-RAD (radiology) 142 , SLAKE (radiology) 145 , and PathVQA (pathology) 143 are frequently employed. Most benchmarking
efforts involve both quantitative evaluation metrics and human evaluations. These models have demonstrated their effectiveness
and the potential for substantial improvements in medical diagnosis tasks.
Discussion One distinct limitation of using LLMs as the sole tool for medical diagnosis is the heavy reliance on subjective
text inputs from the patient. Since LLMs are text-based, they lack the inherent capability to analyze medical diagnostic
imagery. Given that objective medical diagnoses frequently depend on visual images, LLMs are often unable to directly conduct
diagnostic assessments as they lack concrete visual evidence to support disease diagnosis 185 . However, they can help with
diagnosis as a logical reasoning tool for improving accuracy in other vision-based models. One such example is ChatCAD 100 ,
where images are first fed into an existing computer-aided diagnosis (CAD) model to obtain tensor outputs. These outputs
are translated into natural language, which is subsequently fed into ChatCAD to summarize results and formulate diagnoses.
ChatCAD achieves a recall score of 0.781, substantially higher than that (0.382) of the state-of-the-art task-specific model.
Nevertheless, all the aforementioned methods of implementing LLMs cannot directly process images; instead, they either rely
on transforming images into text beforehand or rely on an external separate vision encoder to embed images.

11/29
4.2 Formatting and ICD-Coding
The international classification of diseases (ICD) 128 is a method of standardizing diagnostic and procedural information of a
clinical session. These ICD codes are recorded in the patient’s EHRs every doctor visit. They are also used for tracking health
metrics, treatment outcomes, and billing. There is a need to automate the ICD labeling process because its manual entry process
is very time-consuming for doctors. Formatting and ICD-Coding usually involve entity extraction, relation extraction, text
generation, and information retrieval. LLMs can help automate ICD coding by extracting medical terms from clinical notes and
assigning corresponding ICD codes 186,136 .
Guideline For example, PLM-ICD 146 builds upon the RoBERTa model 39 , fine-tuning it specifically for ICD coding and
achieving strong performance on 70,539 notes from the MIMIC-II and MIMIC-III datasets 56 , as evaluated by accuracy. The
base model used in PLM-ICD is domain-specific with medicine-specific knowledge to enhance the ability to understand medical
terms. PLM-ICD uses segment pooling, the algorithm that divides long input texts into shorter representations using LLMs
when the input surpasses the maximum allowable length. Lastly, it relates the encoding to the augmented labels to output ICD
codes for each clinical input. PLM-ICD produced a higher AUC score than previous state-of-the-art lightweight models 146 .
DRG-LLaMA 148 leverages the LLaMA model and applies parameter-efficient fine-tuning techniques, such as LoRA, to adapt
the model to this task. ChatICD 150 and LLM-codex 151 both utilize the ChatGPT model with prompts for ICD coding. However,
LLM-codex 151 takes this a step further by training an LSTM model on top of the ChatGPT responses, demonstrating its
strong performance. ICD coding can be formulated as a multi-label classification task, and most work in this area utilizes the
MIMIC-III dataset for training and evaluation. Models are typically assessed based on their F1 score, AUC, and Precision@k,
considering either the top 50 most frequent labels or the full label set.
Discussion One challenge while deploying LLMs for clinical coding is the potential biases and hallucinations. In particular,
traditional multi-label classification models can easily constrain their outputs to a predefined list of (usually >1000) ICD
candidate codes through a classification neural network. In contrast, generative LLMs can suffer from major hallucinations
while the input text is lengthy. As a result, the LLM may assign an ICD code that is not in the candidate list or a non-existent
ICD code to the input text. It leads to confusion when interpreting medical records 23 and is, therefore, crucial to establish a
proactive mechanism to detect and rectify errors before they enter patient EHRs.

4.3 Clinical Report Generation


Clinical reports, such as radiology reports, discharge summaries, and patient clinic letters, refer to standardized documentation
that healthcare workers complete after each patient visit 187 . Therefore, clinical report generation usually involves text
generation/summarization, and information retrieval. A large portion of the report is often medical diagnostic results. It
is typically tedious for overworked clinicians to write clinical reports, and thus they are often incomplete or error-prone.
Meanwhile, LLMs can be used intuitively as a summarization tool to help with clinical report generation. In this instance, LLMs
act as an assistant tool for clinicians which helps improve efficiency and reduce potential errors in lengthy reports 20,188,135 .
Another popular approach to generating clinical reports using LLMs involves incorporating a vision-based model to provide
complementary information 100,156,189 . The vision model analyzes the input medical image and generates an annotation, which
serves as a direct and supplementary input to the LLM alongside additional text prompts. By leveraging the combination
of visual and textual information, the LLM produces accurate and fluent reports that adhere to the specified parameters and
structure.
Guideline For radiology report generation, general medical vision-language models like Med-Gemini 95 , LlaVA-Med 144 ,
and Med-Flamingo 141 can be utilized, but models trained specifically on radiographs, such as ChatCAD 100 , MAIRA-1 155
and RadFM 156 , have shown superior performance than the general medical ones. Alternatively, language models like
ImpressionGPT 152 leverage textual data for report summarization, employing dynamic prompt generation and iterative
optimization. RadAdapt 154 systematically evaluates various language models and lightweight adaptation methods, achieving
optimal performance through pre-training on clinical text and parameter-efficient fine-tuning with LoRA, investigating the
impact of few-shot prompting. In terms of evaluation, most work uses MIMIC-III or MIMIC-IV notes for training and
evaluation, as it is the largest publicly available free-text EHRs. Common automatic evaluation metrics includes lexical methods
such as BLEU 190 , ROUGE 191 , METEOR 192 , semantic-based methods such as BERTScore 193 , and radiology-specific metrics
such as CheXbert similarity 194 , RadGraph 195 , RadCliQ 196 .
Discussion While LLMs have demonstrated the ability to generate clinical reports that are more comprehensive and precise
than those written by human counterparts 133 , they still face challenges in terms of hallucinations and literal interpretation of
inputs, lacking the assumption-based perspective often employed by human doctors. Moreover, LLM-generated reports tend to
be less concise compared to human-written ones. The evaluation of LLMs in this domain is particularly challenging due to the
specialized nature of the content and the generative nature of the task. Current automatic evaluation methods for clinical report

12/29
generation primarily focus on lexical metrics, which can lead to biased and inaccurate assessments of the contextual information
present in the reports 197 . For instance, consider two sentences with similar meanings but different wordings: “The patient’s
blood glucose level is within normal limits” and “The patient does not exhibit signs of hyperglycemia”. While both convey the
absence of hyperglycemia, lexical evaluation metrics may struggle to accurately capture their semantic equivalence, as they
rely on direct word-level comparisons. This discrepancy highlights the need for more sophisticated evaluation techniques that
can account for the nuances and variations in expressing clinical information. Developing evaluation methods that go beyond
surface-level similarities and consider the underlying medical context is crucial for ensuring the reliability and usefulness of
LLMs in generating clinical reports.

4.4 Medical Robotics


Medical robotics are revolutionizing healthcare, offering precision in ultrasound scanning, diagnostic analyses, and multi-agent
surgical planning 159,160,157 . These intelligent systems enhance the medical workforce, bridge gaps in staffing, and perform
tasks that extend human physical limits. Incorporating LLMs, robots interpret environmental data to navigate and execute
complex tasks.
Incorporating LLMs, robots interpret environmental data to navigate and execute complex tasks. The Graph-based Robotic
Instruction Decomposer 198 exemplifies this by utilizing natural language for route planning. This method has been shown
to outperform GPT-4 by over 25.4% accuracy in simultaneously predicting the correct action and object, as well as 43.6%
accuracy in correctly predicting instruction tasks 198 .
Moreover, robotics in surgery now involve multi-agent planning systems 199 . Such systems involve a symphony of robotic
units working collaboratively, each programmed to perform specific tasks that complement one another. For example, one
robotic arm may be responsible for precise incisions while another manages real-time imaging, providing surgeons with a
dynamic view of the operation. This orchestrated approach enhances surgical accuracy, reduces procedure times, and minimizes
patient recovery periods. In diagnostics, robots leverage the power of data analytics to sift through extensive EHRs, identifying
patterns and anomalies that may elude the human eye 200 .
Using LLMs in medical robotics can also improve human-computer interactions. Robots with improved interactivity may
recognize human emotions and requests through natural language inputs. This makes patient communication with robots less
intimidating and more user-friendly than existing implementations 201,137 .
Guideline SuFIA 157 incorporates the strong reasoning capabilities of LLMs with perception modules to implement high-level
planning and low-level control of a robot for surgical sub-task execution, with the current best results obtained via API calls to
GPT-4 Turbo. While measures are incorporated to improve safety and reliability, deploying autonomous or semi-autonomous
robotic surgical assistants (RSAs) in real-world scenarios still has the potential risks from unexpected circumstances. Another
work UltrasoundGPT 159 proposes an ultrasound-bodied intelligence system equipping ultrasound robots with LLMs and
domain knowledge 159 . It designs an ultrasound operation knowledge database for the LLM to enable precise motion planning.
A dynamic scanning strategy based on prompt engineering allows LLMs to adjust motion planning during procedures. Their
system improves ultrasound scan efficiency and quality from verbal commands, contributing to non-invasive diagnostics and
streamlined workflows. Another work 160 attempts to interpret domain-specific language in X-ray-guided surgery. It consists of
a minimal protocol enabling an LLM, i.e., GPT-4, to control a robotic X-ray system, namely the Brainlab Loop-X device.
Discussion Some challenges with implementing medical robotics are similar to those with collaborative robots (cobots),
which are designed to work alongside and interact with human workers in shared spaces 202 . Integrating LLMs into medical
robotics algorithms for route planning and motion control poses a critical challenge due to the risk of errors and biases inherent
in LLMs. The complex and dynamic nature of shared human-robot workspaces may lead to LLM-powered cobots misjudging
human intentions or making inappropriate decisions, posing safety risks. In contrast, traditional industrial robots, designed to
operate at high speeds in controlled, human-free environments, could better withstand and contain errors or malfunctions due to
their rigid and powerful designs, effectively mitigating human injury risks. Future research opportunities could explore safety
features for cobots, such as sophisticated sensing technologies and physical design constraints, which aim to minimize the
occurrence and consequences of judgment errors related to LLMs in shared human-robot environments 203,204,205 .

4.5 Medical Language Translation


There are two main areas of medical language translation; the translation of medical terminology from one language to
another 206 and the translation of medical dialogue for ease of interpretation by non-professional personnel 207 . Both areas
are important for seamless communication between different groups. The language barrier to global collaboration in both
research and medical techniques can be largely reduced with the help of LLMs. Language translation also improves accuracy in
education resources and research articles, making knowledge accessible worldwide 208 .

13/29
Guideline Medical mT5 161 , Apollo 164 , and BiMediX 165 are multilingual large language models in the medical domain.
Medical mT5 161 , which is based on multilingual T5 (mT5) with 738 million / 3 billion parameters, is trained on 4.5 billion
tokens consisting of various languages languages, i.e., English, French, Italian, and Spanish. Apollo 164 supports English,
Chinese, French, Spanish, Arabic, and Hindi based on the Qwen model at various relatively small sizes (i.e., 0.5B, 1.8B,
2B, 6B, and 7B), achieving the best performance among models of equivalent size. BiMediX 165 is a bilingual medical
mixture-of-experts language model for English and Arabic, proposing a semi-automated English-to-Arabic translation pipeline
with human refinement for high-quality translations. For medical translation to lay language, a work aims to enhance the
performance of language models in biomedical abstractive summarisation by aggregating knowledge from external papers cited
within the source article 166 . It proposes a novel attention-based citation aggregation model that integrates domain-specific
knowledge from citation papers, allowing neural networks to generate summaries by leveraging both the paper content and
relevant knowledge from citation papers 166 . Another work introduces Retrieval-Augmented Lay Language (RALL) generation
with a large and broad-ranging 63k lay language generation pairs from 12 journals, intuitively fitting the need for external
knowledge beyond expert-authored source documents 167 . It also evaluates the ability of both an open-source LLaMA-2 and
closed-source GPT-4 in background explanation, with and without retrieval augmentation.
Discussion In both translation and simplification tasks, misinterpretation is a common occurrence that can have damaging
consequences. In developing and deploying medical translation and simplification platforms, developers should prioritize
professional datasets, such as textbooks and peer-reviewed journals for medical knowledge recall. This way, it will be less likely
for misinformation from unreliable web sources to skew the output 209 . Another ethical consideration of using LLMs to perform
medical translation is the potential for discriminatory verbiage to be inserted inadvertently into the output. Such verbiage is
difficult to prevent due to the nature of the pipeline. This may cause miscommunications and even have legal consequences. 207 .

4.6 Medical Education


LLMs can be incorporated into the medical education system in different ways, including facilitating study through explanations,
aiding in language translation, answering questions, assisting with medical exam preparation, and providing Socratic-style
tutoring 201,138 . Therefore, medical education could involve text generation, text simplification, semantic textual similarity,
information retrieval, and etc. It has been suggested that medical education can be augmented by generating scenarios,
problems, and corresponding answers by an LLM. Students will gain a richer educational experience through personalized
study modules and case-based assessments, encountering a wider array of challenges and scenarios beyond those found in
standard textbooks 207 . LLMs can also generate feedback on student responses to practical problems, allowing students to know
their areas of weakness in real time. Inherently, these will better prepare these medical students for the real world since they
would have been exposed to more scenarios 208 .
Another use of LLMs in the medical field is educating the public. Medical dialogues are often complex and difficult to
understand for the average patient. LLMs can tune the textual output of prompts to use varying degrees of medical terminology
for different audiences. This will make medical information easy to understand for the average person while ensuring medical
professionals have access to the most precise information 207 .
Guideline Large language models such as ChatGPT 168 , and Med-Gemini 95 are increasingly demonstrating their potential
in medical education applications. These powerful models, trained on vast amounts of medical data, offer capabilities in
knowledge synthesis, question answering, and content generation that can augment traditional teaching methods. For instance,
ChatGPT 168 can provide explanations and clarifications on complex medical concepts, facilitating self-study and reinforcing
understanding. Med-Gemini 95 , a multimodal model, can analyze medical images and generate detailed reports, aiding in the
training of diagnostic skills. Institutions are exploring the integration of these language models into curricula, leveraging their
strengths while ensuring proper oversight and ethical considerations. As this technology continues to advance, it holds promise
for enhancing the efficiency and accessibility of medical education while complementing human expertise.
Discussion Potential downsides of using LLMs in medical education include the current lack of ethical training and biases in
training datasets 24 . These biases, if not addressed, can propagate through the generated outputs, reinforcing stereotypes and
potentially leading to discrimination in medical education. The lack of explicit ethical training during LLM development may
also result in the generation of content that does not align with the ethical principles and guidelines of the medical profession,
such as promoting unethical practices or violating patient privacy.
Furthermore, the risk of misinformation, particularly in the form of hallucinations, presents a challenge in utilizing LLMs
for medical education. LLMs can generate plausible-sounding but factually incorrect information, which can mislead students
and healthcare professionals if relied upon without proper verification. This can lead to the propagation of misconceptions,
inappropriate treatment strategies, or misdiagnosis 210 . To mitigate these risks, it is essential to establish rigorous fact-checking
and validation processes and emphasize the importance of critical thinking, evidence-based practice, and the verification of
information from multiple reliable sources in medical education.

14/29
4.7 Mental Health Support
Mental health support involves both diagnosis and treatment. For example, depression is treated through a variety of
psychotherapies, including cognitive behavior therapy, interpersonal psychotherapy, psychodynamic therapy, etc. 139 . Many of
these techniques are primarily dominated by patient-doctor conversations, with lengthy treatment plans that are cost-prohibitive
for many. The ability of LLMs to serve as conversation partners and companions may lower the barrier to entry for patients
with financial or physical constraints 211 , increasing the accessibility to mental health treatments 170 . There have been various
research works and discussions on the effects of incorporating LLMs into the treatment plan 170,212,213 .
The level of self-disclosure has a heavy impact on the effectiveness of mental health diagnosis and treatment. The degree of
willingness to share has a direct impact on the diagnosis results and treatment plan. Studies have shown that patient willingness
to discuss mental health-related topics with a robot is high 214,212 . Alongside the convenience and lower financial stakes, mental
health support by LLMs has the potential to be more effective than human counterparts in many scenarios.
Guideline Development and deployment of LLMs targeted at mental health support can start with an existing LLM. Instead
of pre-training or fine-tuning on general medical data, it is often better to use medical question and answer data as most of
the LLM’s work will be talking to the patient, which involves back-and-forth conversation in the format of question and
answering 215 . PsyChat 169 is a client-centric LLM dialogue system that provides psychological support comprising five
modules: client behavior recognition, counselor strategy selection, input packer, response generator, and response selection.
Specifically, the response generator is fine-tuned with ChatGLM-6B with a vast dialogue dataset. Through both automatic
and human evaluations, the system has demonstrated its effectiveness and practicality in real-life mental health support
scenarios. ChatCounselor is designed to provide mental health support. It initializes from Vicuna and fine-tunes from an 8k
size instruct-tuning dataset collected from real-world counseling dialogue examples 170 . Psy-LLM is an LLM aimed to be an
assistive mental health tool to support the workflow of professional counselors, particularly to support those who might be
suffering from depression or anxiety 215 . Another work presents a comprehensive evaluation of prompt engineering, few-shot,
and fine-tuning techniques on multiple LLMs in the mental health domain 171 . The results reveal that fine-tuning on a variety of
datasets can improve LLM’s capability on multiple mental-health-specific tasks across different datasets simultaneously 171 .
The work also releases their model Mental-Alpaca and Mental-FLAN-T5 as open-source LLMs targeted at multiple mental
health prediction tasks 171 .
Discussion Two of the most critical difficulties in employing LLMs for mental health support are the lack of emotional
understanding and the risk of inappropriate or harmful responses 216 . LLMs, being language models, may struggle to fully
grasp and respond to the complex emotional states and needs of individuals seeking mental health support. They may not be
able to provide the same level of empathy and human connection that is crucial in therapeutic interactions.
Moreover, if not properly trained or controlled, LLMs may generate responses that are inappropriate, insensitive, or even
harmful to individuals in vulnerable emotional states 217 . They may provide advice that is not grounded in evidence-based
psychological practices or that goes against established mental health guidelines. Addressing these challenges requires rigorous
training of LLMs in evidence-based practices, ethical considerations, and risk assessment protocols, as well as collaboration
between mental health professionals and AI researchers.

4.8 Medical Inquiry and Response


The rapid advancement of LLMs also opens up new possibilities for improving healthcare delivery and patient care. LLMs,
trained on vast amounts of medical knowledge, have the potential to understand and generate human-like text, making them
suitable for tasks such as answering patient inquiries and assisting physicians in documentation 180,218 . As the demand for
accessible and efficient healthcare services grows, researchers are exploring the use of medical LLMs to alleviate the burden on
healthcare professionals and provide patients with reliable information and support. Therefore, medical inquiry and response
could involve entity extraction, information retrieval, question answering, and text generation/summarization.
Guideline Several pioneering systems demonstrate the feasibility and effectiveness of this approach. Healthcare Copilot 180
integrates dialogue, memory, and processing components to enable safe patient-LLM interactions, enhance conversations
with historical data, and summarize consultations. Google’s Articulate Medical Intelligence Explore (AMIE) 179 engages in
diagnostic conversations with patients, exhibiting reasoning capabilities comparable to human doctors. AMIE employs a novel
self-play-based simulated environment with automated feedback mechanisms to scale learning across diverse medical contexts.
Another LLM-based diagnostic system 181 enhances planning capabilities by emulating doctors, utilizing reinforcement learning
for disease screening and initial diagnoses, and leveraging LLMs to parse guidelines and conduct differential diagnoses. These
systems showcase the potential of medical LLMs in providing high-quality, AI-powered medical consultations and assisting
physicians in their daily practice.

15/29
Discussion However, there is still far from deploying them in the real-world healthcare system. Several challenges must be
addressed before widespread deployment in real-world healthcare settings. One major concern is the potential for biased or
inaccurate outputs, which could lead to improper medical advice or misdiagnosis 210 . Rigorous testing and validation across
diverse patient populations and medical contexts are essential to ensure the reliability and generalizability of these systems.
Additionally, the integration of medical LLMs into existing healthcare workflows and infrastructure may require substantial
technical and organizational efforts. Privacy and security concerns surrounding patient data must also be carefully considered
and addressed.
Furthermore, the development and deployment of medical LLMs raise important ethical and responsible AI considerations.
Ensuring transparency, explainability, and accountability in the decision-making processes of these systems is crucial to
maintaining trust and facilitating informed consent from patients 219,220 . The potential impact on the doctor-patient relationship
and the role of human physicians in an AI-assisted healthcare setting must also be carefully examined. Ongoing collaboration
between AI researchers, healthcare professionals, ethicists, and policymakers will be necessary to establish guidelines and best
practices for the responsible development and deployment of medical LLMs in real-world healthcare settings.

5 Challenges
We address the challenges and discuss solutions to the adoption of LLMs in an array of medical applications.

5.1 Hallucination
Hallucination of LLMs refers to the phenomenon where the generated output contains inaccurate or nonfactual information.
It can be categorized into intrinsic and extrinsic hallucinations 221,210 . Intrinsic hallucination generates outputs logically
contradicting factual information, such as wrong calculations of mathematical formulas 210 . Extrinsic hallucination happens
when the generated output cannot be verified, typical examples include LLMs ‘faking’ citations that do not exist or ‘dodging’
the question. When integrating LLMs into the medical domain, fluent but nonfactual LLM hallucinations can lead to the
dissemination of incorrect medical information, causing misdiagnoses, inappropriate treatments, and harmful patient education.
It is therefore vital to ensure the accuracy of LLM outputs in the medical domain.
Potential Solutions Current solutions to mitigate LLM hallucination can be categorized into training-time correction,
generation-time correction, and retrieval-augmented correction. The first (i.e. training-time correction) adjusts model parameter
weights, thus reducing the probability of generating hallucinated outputs. Its examples include factually consistent reinforcement
learning 222 and contrastive learning 223 . The second (i.e. generation-time correction) adds a ‘reasoning’ process to the LLM
inference to ensure reliability, using drawing multiple samples 224 or a confidence score to identify hallucination before the final
generation. The third approach (i.e. retrieval-augmented correction) utilizes external resources to mitigate hallucination, for
example, using factual documents as prompts 225 or chain-of-retrieval prompting technique 226 .

5.2 Lack of Evaluation Benchmarks and Metrics


Current benchmarks and metrics often fail to evaluate LLM’s overall capabilities, especially in the medical domain. For example,
MedQA (USMLE) 14 and MedMCQA 130 offer extensive coverage on QA tasks but fail to evaluate important LLM-specific
metrics, including trustworthiness, helpfulness, explainability, and faithfulness 197 . It is therefore imperative to develop domain
and LLM-specific benchmarks and metrics.
Potential Solutions Singhal et al. 10 proposed HealthSearchQA consisting of commonly searched health queries, offering a
more human-aligned benchmark for evaluating LLM’s capabilities in the medical domain. Benchmarks such as TruthfulQA 227
and HaluEval 228 evaluate more LLM-specific metrics, such as truthfulness, but do not cover the medical domain. Future
research is necessary to meet the need for more medical and LLM-specific benchmarks and metrics than what is currently
available.

5.3 Domain Data Limitations


Current datasets in the medical domain (Table 2) remain relatively small compared to datasets for training general-purpose
LLMs (Table 1). These limited small datasets only cover a small space 10 of the vase domain of medical knowledge. This
results in LLMs exhibiting extraordinary performance on open benchmarks with extensive data coverage, yet falling short on
real-life tasks such as differential diagnosis and personalized treatment planning 11 .
Although the volume of medical and health data is large, most require extensive ethical, legal, and privacy procedures
to be accessed. In addition, these data are often unlabeled, and solutions to leverage these data, such as human labeling and
unsupervised learning 229 , face challenges due to the lack of human expert resources and small margins of error.

16/29
Potential Solutions Current state-of-the-art approaches 11,15 typically fine-tune the LLMs on smaller open-sourced datasets
to improve their domain-specific performance. Another solution is to generate high-quality synthetic datasets using LLMs to
broaden the knowledge coverage; however, it has been discovered that training on generated datasets causes models to forget 230 .
Future research is needed to validate the effectiveness of using synthetic data for LLMs in the medical field.

5.4 New Knowledge Adaptation


LLMs are trained on extensive data to learn knowledge. Once trained, it is expensive and inefficient to inject new knowledge
into an LLM through re-training. However, it is sometimes necessary to update the knowledge of the LLM, for example, on a
new adverse effect of a medication or a novel disease. Two problems occur during such knowledge updates. The first problem
is how to make LLMs appropriately ‘forget’ the old knowledge, as it is almost impossible to remove all ‘old knowledge’ from
the training data, and the discrepancy between new and old knowledge can cause unintended association and bias 231 . The
second problem is the timeliness of the additional knowledge - how do we ensure the model is updated in real-time 232 ? Both
problems pose substantial barriers to using LLMs in medical fields, where accurate and timely updates of medical knowledge
are crucial in real-world implementations.
Potential Solutions Current solutions to knowledge adaptation can be categorized into model editing and retrieval-augmented
generation. Model editing 233 alters the knowledge of the model by modifying its parameters. However, this method does
not generalize well, with their effectiveness varying across different model architectures. In contrast, retrieval-augmented
generation provides external knowledge sources as prompts during model inference; for example, Lewis et al. 234 enabled model
knowledge updates by updating the model’s external knowledge memory.

5.5 Behavior Alignment


Behavior alignment refers to the process of ensuring that the LLM’s behaviors align with the objectives of its task. Development
efforts have been spent on aligning LLMs with general human behavior, but the behavior discrepancy between general humans
and medical professionals remains challenging for adopting LLMs in the medical domain. For example, ChatGPT is well
aligned with general human behavior, but their answers to medical consultations are not as concise and professional as those by
human experts 45 . In addition, misalignment in the medical domain introduces unnecessary harm and ethical concerns 235 that
lead to undesirable consequences.
Potential Solutions Current solutions include instruction fine-tuning, reinforcement learning from human feedback (RLHF) 45 ,
and prompt tuning 118,115 .Instruction fine-tuning 110 refers to improving the performance of LLMs on specific tasks based on
explicit instructions. For example, Ouyang et al. 45 used it to help LLMs generate less toxic and more suitable outputs. RLHF
uses human feedback to evaluate and align the outputs of LLMs. It is effective in multiple tasks, including becoming helpful
chatbots 236 and decision-making agents 237 . Prompt tuning can also align LLMs to the expected output format. For example,
Liu et al. 238 uses a prompting strategy, chain of hindsight, to enable the model to detect and correct its errors, thus aligning the
generated output with human expectations.

5.6 Ethical and Safety Concerns


Concerns have been raised regarding using LLMs (e.g. ChatGPT) in the medical domain 239 , with a focus on ethics, account-
ability, and safety. For example, the scientific community has disapproved of using ChatGPT in writing biomedical research
papers 219 due to ethical concerns. The accountability of using LLMs as assistants to practice medicine is challenging 109,240 . Li
et al. 241 and Shen et al. 220 found that prompt injection can cause the LLM to leak personally identifiable information (PII), e.g.
email addresses, from its training data, which is a substantial vulnerability when implementing LLM in the medical domain.
Potential Solutions With no immediate solutions available, we have nevertheless observed research efforts to understand the
cause of these ethical and legal concerns. For example, Wei et al. 242 propose that PII leakage is attributed to the mismatched
generalization between safety and capability objectives (i.e., the pre-training of LLMs utilizes a larger and more varied dataset
compared to the dataset used for safety training, resulting in many of the model’s capabilities are not covered by safety training).

5.7 Regulatory Challenges


The regulatory landscape of LLMs presents distinct challenges due to their large scale, broad applicability and varying reliability
across applications. As LLMs progressively permeate the fields of medicine and healthcare, their versatility allows a single
LLM family to facilitate a multitude of tasks across a broad spectrum of interest groups. This represents a substantial departure
from the AI-based medical technologies of the past, which were typically tailored to meet specific medical needs and cater to
particular interest groups 243,182 . In addition, the recent innovations of AI-enabled personalized approaches in areas such as
oncology also present challenges to the traditional one-for-all auditing process 244 . This divergence and innovation necessitate

17/29
Future Directions

Development Deployment

New Multimodal Medical LLMs in Underrepresented Interdisciplinary


Benchmarks Large Language Models Agents Specialties Collaborations

1. Comprehensive 1. Multi-agent 1. Active medical


1. Integration of Vision collaboration in 1. Under-Representation professional
Benchmark
Development and Language healthcare in Specialized Fields involvement

2. Clinical Skill 2. Integration of Visual, 2. Specialized role 2. Potential in Sports 2. Real-world testing
Audio, and Language modeling with LLMs Medicine and evaluation
Evaluation

3. LLMs for of Time- 3. Continuous learning 3. Role in Physical 3. Assessing and


3. Ethical and Fairness
Considerations Series Data through feedback Activity Education mitigating LLM
loops risks

Figure 6. Future directions of LLMs in clinical medicine in terms of both development and deployment.

regulators to develop adaptable, foresightful frameworks to ensure the safety, ethical standards, and privacy of the new family
of LLMs-powered medical technologies.
Potential Solutions To address the complex regulatory challenges without hindering innovation, regulators should devise
adaptive, flexible, and robust frameworks. Drawing on the insights from Mesko and Topol 243 , creating a dedicated regulatory
category and implementing patient design to enhance decision-making for LLMs used for medical purposes can better address
their unique attributes and minimize harm. Furthermore, the insights outlined by Derraz et al. 244 emphasize the importance of
implementing agile regulatory frameworks that can keep pace with the fast-paced advancements in personalized applications.
Researchers both inside 243,244 and outside of healthcare 245,246 have proposed innovative strategies to regulate the use of LLMs
involving (i) assessing LLMs-enabled applications in real-world settings, (ii) obligations of transparency of data and algorithms,
(iii) adaptive risk assessment and mitigation processes, (iv) continuous testing and refinement of audited technologies. Such
proactive regulatory adaptations are crucial to maintaining high standards of safety, ethics, and trustworthiness of medical
technology.

6 Future Directions
Although LLMs have already made an impact on people’s lives through chatbots and search engines, their integration into
medicine is still in the infant stage. As shown in Figure 6, numerous new avenues of medical LLMs await researchers and
practitioners to explore how to better serve the general public and patients.

6.1 Introduction of New Benchmarks


Recent studies have underscored the shortcomings of existing benchmarks in evaluating LLMs for clinical applications 247,248 .
Traditional benchmarks, which primarily gauge accuracy in medical question-answering, inadequately capture the full spectrum
of clinical skills necessary for LLMs 10 . Criticisms have been leveled against the use of human-centric standardized medical
exams for LLM evaluation, arguing that passing these tests does not necessarily reflect an LLM’s proficiency in the nuanced
expertise required in real-world clinical settings 10 . In response, there is an emerging consensus on the need for more
comprehensive benchmarks. These should include capabilities like sourcing from authoritative medical references, adapting
to the evolving landscape of medical knowledge, and clearly communicating uncertainties 19,10 . To further enhance the
relevance of these benchmarks, new benchmarks should incorporate scenarios that test an LLM’s ability through simulation of
real-world applications and adjust to feedback from clinicians while maintaining robustness. Additionally, considering the
sensitive nature of healthcare, these benchmarks should also assess factors such as fairness, ethics, and equity, which, though
crucial, pose quantification challenges 10 . While efforts such as the AMIE study have advanced benchmarking by utilizing
real physician evaluations and comprehensive criteria rooted in actual clinical skills and communication, as reflected in the
Objective Structured Clinical Examination (OSCE), there remains a pressing need for benchmarks that are adaptive, scalable
and robust for other diverse and personalized applications of LLMs. The aim is to create benchmarks that more effectively
mirror diverse real-world clinical scenarios, thus providing a more accurate measure of LLMs’ suitability for their applications
in medicine. Future research may focus on (i) using synthetic data along with real-world data to create benchmarks that are
both comprehensive and scalable, (ii) using clinical guidelines and criteria to reflect real-world values that are not normally

18/29
included in traditional benchmarks, (iii) physician-in-the-loop benchmarks to evaluate the performance of LLMs leveraging
their human counterparts or users.

6.2 Multimodal LLM Integrated with Time-Series, Visual, and Audio Data
Multimodal LLMs (MLLMs), or Large Multimodal Models (LMMs), are LLM-based models designed to perform multimodal
(e.g. involving both visual and textual) tasks 249 . While LLMs primarily address NLP tasks, MLLMs support a broader range of
tasks, such as comprehending the underlying meaning of a meme and generating website codes from images. This versatility
suggests promising applications of MLLMs in medicine. Several MLLM-based frameworks integrating vision and language, e.g.
MedPaLM M 250 , LLaVA-Med 251 , Visual Med-Alpaca 252 , Med-Flamingo 253 , and Qilin-Med-VL 254 , have been proposed to
adopt the medical image-text pairs for fine-tuning, thus enabling the medical LLMs to efficiently understand the input medical
(e.g. radiology) images. A recent study 255 proposes to integrate vision, audio, and language inputs for automated diagnosis
in dentistry. However, there exist only very few medical LLMs that can process time series data, such as electrocardiograms
(ECGs) 256 and sphygmomanometers (PPGs) 257 , despite such data being important for medical diagnosis and monitoring.
Although early in their proposed research stages, these studies suggest that MLLMs trained at scale have the potential to
effectively generalize across various domains and modalities outside of NLP tasks. However, the training of MLLMs at scale is
still costly and ineffective, resulting in the size of MLLMs being much smaller than LLMs. Moving forward, future research
may focus on (i) more effective processing, representation, and learning of multi-modal data and knowledge, (ii) cost-effective
training of MLLMs, especially modalities that are more resource-demanding such as videos and images, (iii) collecting or
accessing safely, currently unavailable, multi-modal data in medicine and healthcare.

6.3 Medical Agents


LLM-based agents 258,259 utilize LLMs as controllers to leverage their reasoning capabilities. By integrating LLMs with external
tools and multimodal perceptions, these agents can interact with environments, learn from feedback, and acquire new skills,
enabling them to solve complex tasks (e.g., software design, molecular dynamics simulation) through human-like behaviors,
such as role-playing and communication 260,261 .
However, integrating these agents effectively within the medical domain remains a challenge. The medical field involves
numerous roles 261 and decision-making processes, especially in disease diagnosis that often requires a series of investigations
involving CT scans, ultrasounds, electrocardiograms, and blood tests. The idea of utilizing LLMs to model each of these
roles, thereby creating collaborative medical agents, presents a promising direction. These agents could mimic the roles of
radiologists, cardiologists, pathologists, etc., each specializing in interpreting specific types of medical data. For example, a
radiologist agent could analyze CT scans, while a pathologist agent could focus on blood test results. The collaboration among
these specialized agents could lead to a more holistic and accurate diagnosis. By leveraging the comprehensive knowledge
base and contextual understanding capabilities of LLMs, these agents not only interpret individual medical reports but also
integrate these interpretations to form a cohesive medical opinion. To enhance the integration of LLMs-based agents, future
research may explore (i) a seamless data pipeline that collects data from various devices and transforms them into data format
compatible with LLMs (ii) effective communication and collaboration between agents, especially in areas such as ensuring
truthfulness during communication, dispute resolution between agents, and role-based data security measures, (iii) real-time
decision-making such as making timely decisions using data collected from remote monitoring devices, (iv) adaptive learning
such as preparing for a new pandemic or learning from unseen medical conditions.

6.4 LLMs in Underrepresented Specialties


Current LLM research in medicine has largely focused on general medicine, likely due to the greater availability of data in
this area 11,240 . This has resulted in the under-representation of LLM applications in specialized fields like ‘rehabilitation
therapy’ or ‘sports medicine’. The latter, in particular, holds potential, given the global health challenges posed by physical
inactivity. The World Health Organization identifies physical inactivity as a major risk factor for non-communicable diseases
(NCDs), impacting over a quarter of the global adult population 262 . Despite initiatives to incorporate physical activity (PA) into
healthcare systems, implementation remains challenging, particularly in developing countries with limited PA education among
healthcare providers 262 . LLMs could play a pivotal role in these settings by disseminating accurate PA knowledge and aiding
in the creation of personalized PA programs 263 . Such applications could enhance PA levels, improving global health outcomes,
especially in resource-constrained environments. To spark innovation in these underrepresented specialties, future research can
focus on areas such as (i) effective data collection in underrepresented specialties, (ii) applications of LLMs in assisting with
tasks of underrepresented specialties, (iii) using LLMs to help progress the research of these underrepresented specialties.

6.5 Interdisciplinary Collaborations


Just as interdisciplinary collaborations are crucial in safety-critical areas like nuclear energy production, collaborations between
the medical and technology communities for developing medical LLMs are essential to ensure AI safety and efficacy in

19/29
medicine. The medical community has primarily adopted LLMs provided by technology companies without rigorously
questioning their data training, ethical protocols, or privacy protection. Medical professionals are therefore encouraged to
actively participate in creating and deploying medical LLMs by providing relevant training data, defining the desired benefits of
LLMs, and conducting tests in real-world scenarios to evaluate these benefits 19,21,22 . Such assessments would help to determine
the legal and medical risks associated with LLM use in medicine and inform strategies to mitigate LLM hallucination 264 .
Additionally, training ‘bilingual’ professionals—those versed in both medicine and LLM technology—is increasingly vital
due to the rapid integration of LLMs in healthcare. Future research may explore (i) interdisciplinary frameworks, such as
frameworks to facilitate the sharing of localized data from rural clinics, (ii) ‘bilingual education programs’ that offer training
from both worlds - AI and medicine, (iii) effective in-house development methods to help hospitals and physicians ‘guard’
patient data from corporations while still being able to embrace innovation.

20/29
References
1. Zhao, W. X. et al. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
2. Yang, J. et al. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712
(2023).
3. Chowdhery, A. et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
4. Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
5. Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
6. Brown, T. et al. Language models are few-shot learners. Adv. neural information processing systems 33, 1877–1901
(2020).
7. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
8. Du, Z. et al. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics, 320–335 (2022).
9. Zeng, A. et al. Glm-130b: An open bilingual pre-trained model. In International Conference on Learning Representations
(2022).
10. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
11. Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint
arXiv:2305.09617 (2023).
12. Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv
preprint arXiv:2311.16452 (2023).
13. Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint
arXiv:2304.14454 (2023).
14. Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical
exams. Appl. Sci. 11, 6421 (2021).
15. Li, Y. et al. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical
domain knowledge. arXiv preprint arXiv:2303.14070 (2023).
16. Han, T. et al. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint
arXiv:2304.08247 (2023).
17. Wang, H. et al. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023).
18. Toma, A. et al. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge
encoding. arXiv preprint arXiv:2305.12031 (2023).
19. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. medicine 29, 1930–1940 (2023).
20. Patel, S. B. & Lam, K. Chatgpt: the future of discharge summaries? The Lancet Digit. Heal. 5, e107–e108 (2023).
21. Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: The potentials and
pitfalls. Annals Intern. Medicine (2024).
22. Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Medicine 3, 141 (2023).
23. Yang, X. et al. A large language model for electronic health records. NPJ Digit. Medicine 5, 194 (2022).
24. Abd-Alrazaq, A. et al. Large language models in medical education: Opportunities, challenges, and future directions.
JMIR Med. Educ. 9, e48291 (2023).
25. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. Adv. neural information processing
systems 13 (2000).
26. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J. & Khudanpur, S. Recurrent neural network based language model. In
Interspeech, vol. 2, 1045–1048 (2010).
27. Sundermeyer, M., Ney, H. & Schlüter, R. From feedforward to recurrent lstm neural networks for language modeling.
IEEE/ACM Transactions on Audio, Speech, Lang. Process. 23, 517–529 (2015).
28. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
29. Vaswani, A. et al. Attention is all you need. Adv. neural information processing systems 30 (2017).
30. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 (2018).
31. Kaplan, J. et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
32. Hoffmann, J. et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
33. He, P., Liu, X., Gao, J. & Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint
arXiv:2006.03654 (2021).
34. Google. Bard: A generative artificial intelligence chatbot. https://gemini.google.com (2023).
35. Taori, R. et al. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
(2023).

21/29
36. Yang, A. et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
37. Chung, H. W. et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
38. Joseph, S. et al. Multilingual simplification of medical texts. arXiv preprint arXiv:2305.12532 (2023).
39. Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
40. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).
41. Chiang, W.-L. et al. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality (2023).
42. Jiang, A. Q. et al. Mistral 7b. arXiv preprint arXiv:2310.06825 (2023).
43. Meta llama 3. https://github.com/meta-llama/llama3 (2024).
44. Bai, J. et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
45. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst.
35, 27730–27744 (2022).
46. Claude. https://www.anthropic.com/claude (2024).
47. Lewis, M. et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and
comprehension. arXiv preprint arXiv:1910.13461 (2019).
48. Tay, Y. et al. Ul2: Unifying language learning paradigms. In International Conference on Learning Representations
(2022).
49. Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics
36, 1234–1240 (2020).
50. National Institutes of Health. PubMed Corpora (https://pubmed.ncbi.nlm.nih.gov/download/). In National Library of
Medicine (2022).
51. https://www.ncbi.nlm.nih.gov/pmc/.
52. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions
on Comput. for Healthc. (HEALTH) 3, 1–23 (2021).
53. Beltagy, I., Lo, K. & Cohan, A. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676
(2019).
54. Ammar, W. et al. Construction of the literature graph in semantic scholar. arXiv preprint arXiv:1805.02262 (2018).
55. Alsentzer, E. et al. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019).
56. Johnson, A. E. et al. Mimic-iii, a freely accessible critical care database. Sci. data 3, 1–9 (2016).
57. Alrowili, S. & Shanker, V. Large biomedical question answering models with albert and electra. In CLEF (Working
Notes), 213–220 (2021).
58. Gururangan, S. et al. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of Association
for Computational Linguistics (ACL) (2020).
59. Lo, K., Wang, L. L., Neumann, M., Kinney, R. & Weld, D. S. S2orc: The semantic scholar open research corpus. arXiv
preprint arXiv:1911.02782 (2019).
60. Yasunaga, M., Leskovec, J. & Liang, P. Linkbert: Pretraining language models with document links. In Proceedings of
Association for Computational Linguistics (ACL) (2022).
61. Phan, L. N. et al. Scifive: a text-to-text transformer model for biomedical literature. arXiv preprint arXiv:2106.03598
(2021).
62. Lu, Q., Dou, D. & Nguyen, T. Clinicalt5: A generative language model for clinical text. In Findings of the Association
for Computational Linguistics: EMNLP 2022, 5436–5443 (2022).
63. Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of bert and elmo
on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, 58–65 (2019).
64. Mutinda, F. W. et al. Detecting redundancy in electronic medical records using clinical bert. In Proceedings of the Annual
Conference of the Association for Natural Language Processing, 16–19 (2020).
65. Mahajan, D. et al. Identification of semantically similar sentences in clinical notes: Iterative intermediate training using
multi-task learning. JMIR medical informatics 8, e22508 (2020).
66. Jin, Q. et al. Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical
information retrieval. arXiv preprint arXiv:2307.00589 (2023).
67. Luo, R. et al. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinforma.
23, bbac409 (2022).
68. Venigalla, A., Frankle, J. & Carbin, M. Biomedlm: a domain-specific large language model for biomedical text. MosaicML.
Accessed: Dec 23, 2 (2022).
69. Gao, L. et al. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
70. Gao, W. et al. Ophglm: Training an ophthalmology large language-and-vision assistant based on instructions and dialogue.
arXiv preprint arXiv:2306.12174 (2023).

22/29
71. Chen, S. et al. Meddialog: a large-scale medical dialogue dataset. arXiv preprint arXiv:2004.03329 3 (2020).
72. Peng, C. et al. A study of generative large language model for medical research and healthcare. arXiv preprint
arXiv:2305.13523 (2023).
73. Xiong, H. et al. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097
(2023).
74. Toyhom. Chinese medical dialogue data. https://github.com/Toyhom/Chinese-medical-dialogue-data (2023). GitHub
repository.
75. Chen, Y. et al. Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health
conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023).
76. Wang, G., Yang, G., Du, Z., Fan, L. & Li, X. Clinicalgpt: Large language models finetuned with diverse medical data and
comprehensive evaluation. arXiv preprint arXiv:2306.09968 (2023).
77. Ye, Q. et al. Qilin-med: Multi-stage knowledge injection advanced medical large language model. arXiv preprint
arXiv:2310.09089 (2023).
78. Healthcaremagic. https://www.healthcaremagic.com.
79. https://www.icliniq.com/.
80. Byambasuren, O. et al. Preliminary study on the construction of chinese medical knowledge graph. J. Chin. Inf. Process.
33, 1–9 (2019).
81. Zhang, H. et al. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075 (2023).
82. Xu, C., Guo, D., Duan, N. & McAuley, J. Baize: An open-source chat model with parameter-efficient tuning on self-chat
data. arXiv preprint arXiv:2304.01196 (2023).
83. Abacha, A. B. & Demner-Fushman, D. A question-entailment approach to question answering. BMC Bioinforma. 20
(2019).
84. Luo, Y. et al. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint
arXiv:2308.09442 (2023).
85. Zhang, X. et al. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint
arXiv:2310.14558 (2023).
86. Yang, S. et al. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback
and real-world multi-turn dialogue. arXiv preprint arXiv:2308.03549 (2023).
87. Shoham, O. B. & Rappoport, N. Cpllm: Clinical prediction with large language models. arXiv preprint arXiv:2309.11295
(2023).
88. Pollard, T. J. et al. The eicu collaborative research database, a freely available multi-center database for critical care
research. Sci. data 5, 1–13 (2018).
89. Johnson, A. et al. Mimic-iv. https://physionet.org/content/mimiciv/1.0/ (2020).
90. Ankit Pal, M. S. Openbiollms: Advancing open-source large language models for healthcare and life sciences. https:
//huggingface.co/aaditya/OpenBioLLM-Llama3-70B (2024).
91. Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079
(2023).
92. Bosselut, A. et al. Meditron: Open medical foundation models adapted for clinical practice. Preprint (2024).
93. Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com (2023).
94. Yang, L. et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024).
95. Saab, K. et al. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024).
96. Tanno, R. et al. Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report
generation. arXiv preprint arXiv:2311.18260 (2023).
97. Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? arXiv preprint
arXiv:2207.08143 (2022).
98. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35,
24824–24837 (2022).
99. Liu, Z. et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032 (2023).
100. Wang, S., Zhao, Z., Ouyang, X., Wang, Q. & Shen, D. Chatcad: Interactive computer-aided diagnosis on medical image
using large language models. arXiv preprint arXiv:2302.07257 (2023).
101. Gao, Y. et al. Leveraging a medical knowledge graph into large language models for diagnosis prediction. arXiv e-prints
arXiv–2308 (2023).
102. Bodenreider, O. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research
32, D267–D270 (2004).
103. Shi, W. et al. Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-

23/29
making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and
Health Informatics, 1–10 (2023).
104. SRS. https://www.srs.org. Accessed: 2024-05-14.
105. UpToDate. http://uptodate.com. Accessed: 2024-05-14.
106. Dynamed. https://www.dynamed.com. Accessed: 2024-05-14.
107. Kim, J. & Min, M. From rag to qa-rag: Integrating generative ai for pharmaceutical regulatory compliance process. arXiv
preprint arXiv:2402.01717 (2024).
108. Zakka, C. et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).
109. He, K. et al. A survey of large language models for healthcare: from data, technology, and applications to accountability
and ethics. arXiv preprint arXiv:2310.05694 (2023).
110. Zhang, S. et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).
111. Wang, H., Liu, C., Zhao, S., Qin, B. & Liu, T. Chatglm-med. https://github.com/SCIR-HI/Med-ChatGLM (2023).
112. Hu, E. J. et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
113. Li, X. L. & Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190
(2021).
114. Liu, X. et al. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics, 61–68 (2022).
115. Liu, X. et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv
preprint arXiv:2110.07602 (2021).
116. Houlsby, N. et al. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning,
2790–2799 (2019).
117. Dong, Q. et al. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
118. Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, 3045–3059 (2021).
119. Gao, Y. et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997
(2023).
120. Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv preprint
arXiv:2402.13178 (2024).
121. Li, X. & Li, J. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871 (2023).
122. Wang, G. et al. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291
(2023).
123. Chen, J. et al. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through
self-knowledge distillation. arXiv preprint arXiv:2309.07597 (2023).
124. Shao, Z. et al. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv
preprint arXiv:2305.15294 (2023).
125. Trivedi, H., Balasubramanian, N., Khot, T. & Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for
knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509 (2022).
126. Asai, A., Wu, Z., Wang, Y., Sil, A. & Hajishirzi, H. Self-rag: Learning to retrieve, generate, and critique through
self-reflection. arXiv preprint arXiv:2310.11511 (2023).
127. Donnelly, K. et al. Snomed-ct: The advanced terminology and coding system for ehealth. Stud. health technology
informatics 121, 279 (2006).
128. Organization, W. H. et al. International classification of diseases:[9th] ninth revision, basic tabulation list with alphabetic
index (World Health Organization, 1978).
129. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. Pubmedqa: A dataset for biomedical research question answering.
arXiv preprint arXiv:1909.06146 (2019).
130. Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical
domain question answering. In Conference on Health, Inference, and Learning, 248–260 (2022).
131. Doğan, R. I., Leaman, R. & Lu, Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization.
J. biomedical informatics 47, 1–10 (2014).
132. Tang, L. et al. Evaluating large language models on medical evidence summarization. npj Digit. Medicine 6, 158 (2023).
133. Van Veen, D. et al. Clinical text summarization: Adapting large language models can outperform human experts. arXiv
preprint arXiv:2309.07430 (2023).
134. Ondov, B., Attal, K. & Demner-Fushman, D. A survey of automated methods for biomedical text simplification. J. Am.
Med. Informatics Assoc. 29, 1976–1988 (2022).
135. Liu, F. et al. Retrieve, reason, and refine: Generating accurate and faithful patient instructions. Adv. Neural Inf. Process.

24/29
Syst. 35, 18864–18877 (2022).
136. Dong, H. et al. Automated clinical coding: what, why, and where we are? NPJ digital medicine 5, 159 (2022).
137. D’Onofrio, G. et al. Emotion recognizing by a robotic solution initiative. Sensors 22, 2861 (2022).
138. Biri, S. K. et al. Assessing the utilization of large language models in medical education: Insights from undergraduate
medical students. Cureus 15 (2023).
139. Vaidyam, A. N., Wisniewski, H., Halamka, J. D., Kashavan, M. S. & Torous, J. B. Chatbots and conversational agents in
mental health: a review of the psychiatric landscape. The Can. J. Psychiatry 64, 456–464 (2019).
140. McDuff, D. et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164
(2023).
141. Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), 353–367
(2023).
142. Lau, J. J., Gayen, S., Ben Abacha, A. & Demner-Fushman, D. A dataset of clinically generated visual questions and
answers about radiology images. Sci. data 5, 1–10 (2018).
143. He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. Pathvqa: 30000+ questions for medical visual question answering. arXiv
preprint arXiv:2003.10286 (2020).
144. Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf.
Process. Syst. 36 (2024).
145. Liu, B. et al. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021
IEEE 18th International Symposium on Biomedical Imaging (ISBI), 1650–1654 (IEEE, 2021).
146. Huang, C.-W., Tsai, S.-C. & Chen, Y.-N. Plm-icd: Automatic icd coding with pretrained language models. arXiv e-prints
arXiv–2207 (2022).
147. Saeed, M., Lieu, C., Raber, G. & Mark, R. G. Mimic ii: a massive temporal icu patient database to support research in
intelligent patient monitoring. In Computers in cardiology, 641–644 (IEEE, 2002).
148. Wang, H., Gao, C., Dantona, C., Hull, B. & Sun, J. Drg-llama: tuning llama model to predict diagnosis-related group for
hospitalized patients. npj Digit. Medicine 7, 16 (2024).
149. Johnson, A. E. et al. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.
Sci. data 6, 317 (2019).
150. Liu, J., Yang, S., Peng, T., Hu, X. & Zhu, Q. Chaticd: Prompt learning for few-shot icd coding through chatgpt. In 2023
IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 4360–4367 (2023).
151. Yang, Z., Batra, S. S., Stremmel, J. & Halperin, E. Surpassing gpt-4 medical coding with a two-stage approach. arXiv
preprint arXiv:2311.13735 (2023).
152. Ma, C. et al. An iterative optimizing framework for radiology report summarization with chatgpt. IEEE Transactions on
Artif. Intell. (2024).
153. Open-i. https://openi.nlm.nih.gov/. Accessed: 2024-05-14.
154. Van Veen, D. et al. Radadapt: Radiology report summarization via lightweight domain adaptation of large language
models. arXiv preprint arXiv:2305.01146 (2023).
155. Hyland, S. L. et al. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint
arXiv:2311.13668 (2023).
156. Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Towards generalist foundation model for radiology. arXiv preprint
arXiv:2308.02463 (2023).
157. Moghani, M. et al. Sufia: Language-guided augmented dexterity for robotic surgical assistants. arXiv preprint
arXiv:2405.05226 (2024).
158. Yu, Q. et al. Orbit-surgical: An open-simulation framework for learning surgical augmented dexterity. arXiv preprint
arXiv:2404.16027 (2024).
159. Xu, H. et al. Enhancing surgical robots with embodied intelligence for autonomous ultrasound scanning. arXiv preprint
arXiv:2405.00461 (2024).
160. Killeen, B. D., Chaudhary, S., Osgood, G. & Unberath, M. Take a shot! natural language control of intelligent robotic
x-ray systems in surgery. Int. J. Comput. Assist. Radiol. Surg. 1–9 (2024).
161. García-Ferrero, I. et al. Medical mt5: an open-source multilingual text-to-text llm for the medical domain. arXiv preprint
arXiv:2404.07613 (2024).
162. Tiedemann, J. Parallel data, tools and interfaces in opus. In Lrec, vol. 2012, 2214–2218 (2012).
163. National Library of Medicine. Clinical trials. https://clinicaltrials.gov/ (2022). Accessed: 2024-05-14.
164. Wang, X. et al. Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people. arXiv
preprint arXiv:2403.03640 (2024).
165. Pieri, S. et al. Bimedix: Bilingual medical mixture of experts llm. arXiv preprint arXiv:2402.13253 (2024).

25/29
166. Tang, C., Wang, S., Goldsack, T. & Lin, C. Improving biomedical abstractive summarisation with knowledge aggregation
from citation papers. arXiv preprint arXiv:2310.15684 (2023).
167. Guo, Y., Qiu, W., Leroy, G., Wang, S. & Cohen, T. Retrieval augmentation of large language models for lay language
generation. J. Biomed. Informatics 149, 104580 (2024).
168. OpenAI. Chatgpt [large language model]. https://chat.openai.com (2023).
169. Qiu, H., Li, A., Ma, L. & Lan, Z. Psychat: A client-centric dialogue system for mental health support. arXiv preprint
arXiv:2312.04262 (2023).
170. Liu, J. M. et al. Chatcounselor: A large language models for mental health support. arXiv preprint arXiv:2309.15461
(2023).
171. Xu, X. et al. Mental-llm: Leveraging large language models for mental health prediction via online text data. Proc. ACM
on Interactive, Mobile, Wearable Ubiquitous Technol. 8, 1–32 (2024).
172. Turcan, E. & McKeown, K. Dreaddit: A reddit dataset for stress analysis in social media. arXiv preprint arXiv:1911.00133
(2019).
173. Naseem, U., Dunn, A. G., Kim, J. & Khushi, M. Early identification of depression severity levels on reddit using ordinal
classification. In Proceedings of the ACM Web Conference 2022, 2563–2572 (2022).
174. Haque, A., Reddi, V. & Giallanza, T. Deep learning for suicide and depression identification with unsupervised label
correction. In International Conference on Artificial Neural Networks, 436–447 (2021).
175. Gaur, M. et al. Knowledge-aware assessment of severity of suicide risk for early intervention. In The world wide web
conference, 514–525 (2019).
176. Sampath, K. & Durairaj, T. Data set creation and empirical analysis for detecting signs of depression from social media
postings. In International Conference on Computational Intelligence in Data Science, 136–151 (2022).
177. Jamil, Z. Monitoring tweets for depression to detect at-risk users. Ph.D. thesis, Université d’Ottawa/University of Ottawa
(2017).
178. Mauriello, M. L. et al. Sad: A stress annotated dataset for recognizing everyday stressors in sms-like conversational
systems. In Extended abstracts of the 2021 CHI conference on human factors in computing systems, 1–7 (2021).
179. Tu, T. et al. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654 (2024).
180. Ren, Z., Zhan, Y., Yu, B., Ding, L. & Tao, D. Healthcare copilot: Eliciting the power of general llms for medical
consultation. arXiv preprint arXiv:2402.13408 (2024).
181. Sun, Z., Luo, C. & Huang, Z. Conversational disease diagnosis via external planner-controlled large language models.
arXiv preprint arXiv:2404.04292 (2024).
182. Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. Ai in health and medicine. Nat. medicine 28, 31–38 (2022).
183. Zhao, Z. et al. Clip in medical imaging: A comprehensive survey. arXiv preprint arXiv:2312.07353 (2023).
184. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
185. Liu, F., Wu, X., Ge, S., Fan, W. & Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report
generation. In IEEE Conference on Computer Vision and Pattern Recognition (2021).
186. Ong, J. et al. Applying large language model artificial intelligence for retina international classification of diseases (icd)
coding. J. Med. Artif. Intell. 6 (2023).
187. Liu, X. et al. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat.
Medicine 25, 1467–1469 (2019).
188. Ali, S. R., Dobbs, T. D., Hutchings, H. A. & Whitaker, I. S. Using chatgpt to write patient clinic letters. The Lancet Digit.
Heal. 5, e179–e181 (2023).
189. Wu, C. et al. Can gpt-4v (ision) serve medical applications? case studies on gpt-4v for multimodal medical diagnosis.
arXiv preprint arXiv:2310.09909 (2023).
190. Papineni, K., Roukos, S., Ward, T. & Zhu, W. BLEU: a Method for automatic evaluation of machine translation. In
Proceedings of Association for Computational Linguistics (ACL) (2002).
191. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Association for Computational
Linguistics (ACL) (2004).
192. Banerjee, S. & Lavie, A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments.
In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or
summarization, 65–72 (2005).
193. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv
preprint arXiv:1904.09675 (2019).
194. Smit, A. et al. Chexbert: combining automatic labelers and expert annotations for accurate radiology report labeling
using bert. arXiv preprint arXiv:2004.09167 (2020).
195. Jain, S. et al. Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463

26/29
(2021).
196. Yu, F. et al. Evaluating progress in automatic chest x-ray radiology report generation. Patterns 4 (2023).
197. Xie, Q. et al. Faithful ai in medicine: A systematic review with large language models and beyond. medRxiv (2023).
198. Ni, Z. et al. Grid: Scene-graph-based instruction-driven robotic task planning. arXiv preprint arXiv:2309.07726 (2023).
199. Wang, J. et al. Large language models for robotics: Opportunities, challenges, and perspectives. arXiv preprint
arXiv:2401.04334 (2024).
200. Pee, L. G., Pan, S. L. & Cui, L. Artificial intelligence in healthcare robots: A social informatics study of knowledge
embodiment. J. Assoc. for Inf. Sci. Technol. 70, 351–369 (2019).
201. Qiu, J. et al. Large ai models in health informatics: Applications, challenges, and the future. IEEE J. Biomed. Heal.
Informatics (2023).
202. Emaminejad, N., Akhavian, R. et al. Trust in construction ai-powered collaborative robots: A qualitative empirical
analysis. arXiv preprint arXiv:2308.14846 (2023).
203. Weerarathna, I. N., Raymond, D. & Luharia, A. Human-robot collaboration for healthcare: A narrative review. Cureus 15
(2023).
204. Moglia, A., Georgiou, K., Georgiou, E., Satava, R. M. & Cuschieri, A. A systematic review on artificial intelligence in
robot-assisted surgery. Int. J. Surg. 95, 106151 (2021).
205. Xia, Y., Wang, S. & Kan, Z. A nested u-structure for instrument segmentation in robotic surgery. In International
Conference on Advanced Robotics and Mechatronics (ICARM), 994–999 (2023).
206. Noll, R., Frischen, L. S., Boeker, M., Storf, H. & Schaaf, J. Machine translation of standardised medical terminology
using natural language processing: A scoping review. New Biotechnol. (2023).
207. Karabacak, M. et al. The advent of generative language models in medical education. JMIR Med. Educ. 9, e48163 (2023).
208. Ahn, S. The impending impacts of large language models on medical education. Korean J. Med. Educ. 35, 103 (2023).
209. Chen, Y., Arunasalam, A. & Celik, Z. B. Can large language models provide security & privacy advice? measuring the
ability of llms to refute misconceptions. In Proceedings of the 39th Annual Computer Security Applications Conference,
366–378 (2023).
210. Rawte, V., Sheth, A. & Das, A. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922
(2023).
211. Stock, A., Schlögl, S. & Groth, A. Tell me, what are you most afraid of? exploring the effects of agent representation on
information disclosure in human-chatbot interaction. arXiv e-prints arXiv–2307 (2023).
212. De Choudhury, M., Pendse, S. R. & Kumar, N. Benefits and harms of large language models in digital mental health.
arXiv preprint arXiv:2311.14693 (2023).
213. Hua, Y. et al. Large language models in mental health care: a scoping review (2024). 2401.02984.
214. Robinson, N., Connolly, J., Suddrey, G. & Kavanagh, D. J. A brief wellbeing training session delivered by a humanoid
social robot: A pilot randomized controlled trial. arXiv e-prints arXiv–2308 (2023).
215. Lai, T. et al. Psy-llm: Scaling up global mental health psychological services with ai-based large language models. arXiv
preprint arXiv:2307.11991 (2023).
216. Ma, Z., Mei, Y. & Su, Z. Understanding the benefits and challenges of using large language model-based conversational
agents for mental well-being support. In AMIA Annual Symposium Proceedings, vol. 2023, 1105 (2023).
217. Chung, N. C., Dyer, G. & Brocki, L. Challenges of large language models for mental health counseling. arXiv preprint
arXiv:2311.13857 (2023).
218. Wang, J., Yang, Z., Yao, Z. & Yu, H. Jmlr: Joint medical llm and retrieval training for enhancing reasoning and
professional question answering capability. arXiv preprint arXiv:2402.17887 (2024).
219. Stokel-Walker, C. Chatgpt listed as author on research papers: many scientists disapprove. Nature 613, 620–621 (2023).
220. Shen, X., Chen, Z., Backes, M., Shen, Y. & Zhang, Y. " do anything now": Characterizing and evaluating in-the-wild
jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825 (2023).
221. Umapathi, L. K., Pal, A. & Sankarasubbu, M. Med-halt: Medical domain hallucination test for large language models.
arXiv preprint arXiv:2307.15343 (2023).
222. Roit, P. et al. Factually consistent summarization via reinforcement learning with textual entailment feedback. arXiv
preprint arXiv:2306.00186 (2023).
223. Chern, I.-C. et al. Improving factuality of abstractive summarization via contrastive reward learning. arXiv preprint
arXiv:2307.04507 (2023).
224. Manakul, P., Liusie, A. & Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large
language models. arXiv preprint arXiv:2303.08896 (2023).
225. Shuster, K., Poff, S., Chen, M., Kiela, D. & Weston, J. Retrieval augmentation reduces hallucination in conversation.
arXiv preprint arXiv:2104.07567 (2021).

27/29
226. Dhuliawala, S. et al. Chain-of-verification reduces hallucination in large language models. arXiv preprint
arXiv:2309.11495 (2023).
227. Lin, S., Hilton, J. & Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint
arXiv:2109.07958 (2021).
228. Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y. & Wen, J.-R. Halueval: A large-scale hallucination evaluation benchmark for
large language models. arXiv e-prints arXiv–2305 (2023).
229. Liu, F. et al. Auto-encoding knowledge graph for unsupervised medical report generation. In Advances in Neural
Information Processing Systems (2021).
230. Shumailov, I. et al. Model dementia: Generated data makes models forget. arXiv preprint arXiv:2305.17493 (2023).
231. Hoelscher-Obermaier, J., Persson, J., Kran, E., Konstas, I. & Barez, F. Detecting edit failures in large language models:
An improved specificity benchmark. arXiv preprint arXiv:2305.17553 (2023).
232. Liu, F. et al. A medical multimodal large language model for future pandemics. npj Digit. Medicine 6, 226 (2023).
233. Yao, Y. et al. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172
(2023).
234. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 33,
9459–9474 (2020).
235. Hendrycks, D. et al. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275 (2020).
236. Glaese, A. et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375
(2022).
237. Nakano, R. et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332
(2021).
238. Liu, H., Sferrazza, C. & Abbeel, P. Chain of hindsight aligns language models with feedback. arXiv preprint
arXiv:2302.02676 3 (2023).
239. Sallam, M. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives
and valid concerns. In Healthcare, 887 (MDPI, 2023).
240. Tian, S. et al. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings
Bioinforma. 25, bbad493 (2024).
241. Li, H., Guo, D., Fan, W., Xu, M. & Song, Y. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint
arXiv:2304.05197 (2023).
242. Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483
(2023).
243. Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative ai) in healthcare.
NPJ digital medicine 6, 120 (2023).
244. Derraz, B. et al. New regulatory thinking is needed for ai-based personalised drug and cell therapies in precision oncology.
NPJ Precis. Oncol. 8, 23 (2024).
245. Hacker, P., Engel, A. & Mauer, M. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023
ACM Conference on Fairness, Accountability, and Transparency, 1112–1123 (2023).
246. Mökander, J., Schuett, J., Kirk, H. R. & Floridi, L. Auditing large language models: a three-layered approach. AI Ethics
1–31 (2023).
247. Chen, Q. et al. An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics 39,
btad557 (2023).
248. Chen, Q. et al. Large language models in biomedical natural language processing: benchmarks, baselines, and recommen-
dations. arXiv preprint arXiv:2305.16326 (2023).
249. Yin, S. et al. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
250. Tu, T. et al. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334 (2023).
251. Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint
arXiv:2306.00890 (2023).
252. Shu, C., Liu, F. & Shareghi, C. Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities.
https://github.com/cambridgeltl/visual-med-alpaca (2023).
253. Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. arXiv preprint arXiv:2307.15189 (2023).
254. Liu, J. et al. Qilin-med-vl: Towards chinese large vision-language model for general healthcare. arXiv preprint
arXiv:2310.17956 (2023).
255. Huang, H. et al. Chatgpt for shaping the future of dentistry: the potential of multi-modal large language model. Int. J.
Oral Sci. 15, 29 (2023).
256. Li, J., Liu, C., Cheng, S., Arcucci, R. & Hong, S. Frozen language model helps ecg zero-shot learning. arXiv preprint

28/29
arXiv:2303.12311 (2023).
257. Englhardt, Z. et al. Exploring and characterizing large language models for embedded system development and debugging.
arXiv preprint arXiv:2307.03817 (2023).
258. Xi, Z. et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864
(2023).
259. Wang, L. et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
260. Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D. & Ghanem, B. Camel: Communicative agents for "mind"
exploration of large scale language model society. arXiv preprint arXiv:2303.17760 (2023).
261. Tang, X. et al. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint
arXiv:2311.10537 (2023).
262. Organization, W. H. Physical activity (2022). Accessed: Aug. 18, 2023.
263. Connor, M. & O’Neill, M. Large language models in sport science & medicine: Opportunities, risks and considerations.
arXiv preprint arXiv:2305.03851 (2023).
264. Mello, M. M. & Guha, N. Chatgpt and physicians’ malpractice risk. In JAMA Health Forum, e231938–e231938 (2023).

Acknowledgements
This work was supported in part by the Pandemic Sciences Institute at the University of Oxford; the National Institute for
Health Research (NIHR) Oxford Biomedical Research Centre (BRC); an NIHR Research Professorship; a Royal Academy
of Engineering Research Chair; the Well-come Trust funded VITAL project; the UK Research and Innovation (UKRI); the
Engineering and Physical Sciences Research Council (EPSRC); and the InnoHK Hong Kong Centre for Cerebro-cardiovascular
Engineering (COCHE).

Author Contributions
FL, ZL, JL, and DC conceived the project. FL conceived and designed the study. HZ, FL, BG, XZ, and JH conducted the
literature review, performed data analysis, and drafted the manuscript. All authors contributed to the interpretation and final
manuscript preparation. All authors read and approved the final manuscript.

Competing Interests
The authors declare no competing interests.

29/29

You might also like