A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]
A Survey of Large Language Models in Medicine Progress, Application, and Challenge 2024 [highlight]
{hongjian.zhou@cs,fenglin.liu@eng,david.clifton@eng}.ox.ac.uk,
[email protected], {jhuang90@ur,jluo@cs}.rochester.edu
ABSTRACT
Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding
and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in
supporting different medical tasks (e.g., enhancing clinical diagnostics, and providing medical education), a comprehensive
review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce.
Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including
the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles
of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data
used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific
needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks,
and further compare them with state-of-the-art lightweight models, aiming to provide a clear understanding of the distinct
advantages and limitations of LLMs in medicine. Overall, in this review, we address the following study questions: 1) What
are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical
setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of
medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this
review aims to provide insights into the opportunities and challenges of LLMs in medicine and serve as a practical resource
for constructing effective medical LLMs. We also maintain a regularly updated list of practical guides on medical LLMs at:
https://github.com/AI-in-Health/MedLLMsPracticalGuide
1 Introduction
The recently emerged general large language models (LLMs) 1,2 , such as PaLM 3 , LLaMA 4,5 , GPT-series 6,7 , and ChatGLM 8,9 ,
have advanced the state-of-the-art in various natural language processing (NLP) tasks, including text generation, text summa-
rization, and question answering. Inspired by these successes, several endeavors have been made to adapt general LLMs to the
medicine domain, leading to the emergence of medical LLMs 10,11 . For example, based on PaLM 3 and GPT-4 7 , MedPaLM-2 11
and MedPrompt 12 have respectively achieved a competitive accuracy of 86.5 and 90.2 compared to human experts (87.0 13 )
in the United States Medical Licensing Examination (USMLE) 14 . In particular, based on publicly available general LLMs
(e.g. LLaMA 4,5 ), a wide range of medical LLMs, including ChatDoctor 15 , MedAlpaca 16 , PMC-LLaMA 13 , BenTsao 17 , and
Clinical Camel 18 , have been introduced. As a result, medical LLMs have gained growing research interests in assisting medical
professionals to improve patient care 19,20 .
Although existing medical LLMs have achieved promising results, there are some key issues in their development and
application that need to be addressed. First, many of these models primarily focus on medical dialogue and medical question-
answering tasks, but their practical utility in clinical practice is often overlooked 19 . Recent research and reviews 19,21,22 have
begun to explore the potential of medical LLMs in different clinical scenarios, including Electronic Health Records (EHRs) 23 ,
Pre-training from Fine-tuning General Prompting General
Scratch LLMs LLMs
Principles: Principles:
Data Medical Pre-training Fine-tuning
Pipeline
(Section 2) Knowledge Bases Data Data
(Section 2)
·
Education Diagnosis Support Generation
Medical
Future Medical Tasks
Directions LLMs Medical Medical Language Formatting and
(Section 3)
(Section 6) Robotics Translation ICD Coding
Figure 1. An overview of the practical guides for medical large language models.
discharge summary generation 20 , health education 24 , and care planning 11 . However, they primarily focus on presenting clinical
applications of LLMs, especially online commercial LLMs like ChatGPT (including GPT-3.5 and GPT-4 7 ), without providing
practical guidelines for the development of medical LLMs. Besides, they mainly perform case studies to conduct the human
evaluation on a small number of samples, thus lacking evaluation datasets for assessing model performance in clinical scenarios.
Second, most existing medical LLMs report their performances mainly on answering medical questions, neglecting other
biomedical domains, such as medical language understanding and generation. These research gaps motivate this review which
offers a comprehensive review of the development of LLMs and their applications in medicine. We aim to cover topics on
existing medical LLMs, various medical tasks, clinical applications, and arising challenges.
As shown in Figure 1, this review seeks to answer the following questions. Section 2: What are LLMs? How can medical
LLMs be effectively built? Section 3: How are the current medical LLMs evaluated? What capabilities do medical LLMs offer
beyond traditional models? Section 4: How should medical LLMs be applied in clinical settings? Section 5: What challenges
should be addressed when implementing medical LLMs in clinical practice? Section 6: How can we optimize the construction
of medical LLMs to enhance their applicability in clinical settings, ultimately contributing to medicine and creating a positive
societal impact?
For the first question, we analyze the foundational principles underpinning current medical LLMs, providing detailed
descriptions of their architecture, parameter scales, and the datasets used during their development. This exposition aims to
serve as a valuable resource for researchers and clinicians designing medical LLMs tailored to specific requirements, such as
computational constraints, data privacy concerns, and the integration of local knowledge bases. For the second question, we
evaluate the performance of medical LLMs across ten biomedical NLP tasks, encompassing both discriminative and generative
tasks. This comparative analysis elucidates how these models outperform traditional AI approaches, offering insights into
the specific capabilities that render LLMs effective in clinical environments. The third question, the practical deployment
of medical LLMs in clinical settings, is explored through the development of guidelines tailored for seven distinct clinical
application scenarios. This section outlines practical implementations, emphasizing specific functionalities of medical LLMs
that are leveraged in each scenario. The fourth question emphasizes addressing the challenges associated with the clinical
deployment of medical LLMs, such as the risk of generating factually inaccurate yet plausible outputs (hallucination), and
the ethical, legal, and safety implications. Citing recent studies, we argue for a comprehensive evaluation framework that
assesses the trustworthiness of medical LLMs to ensure their responsible and effective utilization in healthcare. For the last
question, we propose future research directions to advance the medical LLMs field. This includes fostering interdisciplinary
collaboration between AI specialists and medical professionals, advocating for a ’doctor-in-the-loop’ approach, and emphasizing
human-centered design principles.
By establishing robust training data, benchmarks, metrics, and deployment strategies through co-development efforts, we
aim to accelerate responsible and efficacious integration of medical LLMs into clinical practice. This study therefore seeks
to stimulate continued research and development in this interdisciplinary field, with the objective of realizing the profound
potential of medical LLMs in enhancing clinical practice and advancing medical science for for the betterment of society.
2/29
BOX 1: Background of Large Language Models (LLMs)
The impressive performance of LLMs can be attributed to Transformer-based language model’s performance as its size increases (in terms of parameters, layers, data, or the
models, large-scale pre-training, and scaling laws. amount of training computed). The scaling laws proposed by OpenAI 31 show that to
Language Models A language model 25,26,27 is a probabilistic model that models achieve optimal model performance, the budget allocation for model size should be
the joint probability distribution of tokens (meaningful units of text, such as words or larger than the data.
subwords or morphemes) in a sequence, i.e., the probabilities of how words and phrases The scaling laws proposed by Google DeepMind 32 show that both model and data sizes
are used in sequences. Therefore, it can predict the likelihood of a sequence of tokens should be increased in equal scales. The scaling laws guide researchers to allocate
given the previous tokens, which can be used to predict the next token in a sequence or resources and anticipate the benefits of scaling models.
to generate new sequences. General Large Language Models Existing general LLMs can be divided into three
The Transformer architecture The recurrent neural network (RNN) 28,26 has been categories based on their architecture (Table 1).
widely used for language modeling by processing tokens sequentially and maintaining Encoder-only LLMs consisting of a stack of Transformer encoder layers, employ a
a vector named hidden state that encodes the context of previous tokens. Nonetheless, bidirectional training strategy that allows them to integrate context from both the left and
sequential processing makes it unsuitable for parallel training and limits its ability to the right of a given token in the input sequence. This bi-directionality enables the models
capture long-range dependencies, making it computationally expensive and hindering to achieve a deep understanding of the input sentences 30 . Therefore, encoder-only
its learning ability for long sequences. The strength of the Transformer 29 lies in LLMs are particularly suitable for language understanding tasks (e.g., sentiment analysis
its fully attentive mechanism, which relies exclusively on the attention mechanism document classification) where the full context of the input is essential for accurate
and eliminates the need for recurrence. When processing each token, the attention predictions. BERT 30 and DeBERTa 33 are the representative encoder-only LLMs.
mechanism computes a weighted sum of the other input tokens, where the weights are Decoder-only LLMs utilize a stack of Transformer decoder layers and are characterized
determined by the relevance between each input token and the current token. It allows by their uni-directional (left-to-right) processing of text, enabling them to generate
the model to adaptively focus on different parts of the sequence to effectively learn language sequentially. This architecture is trained unidirectionally using the next token
the joint probability distribution of tokens. Therefore, Transformer not only enables prediction training objective to predict the next token in a sequence, given all the previous
efficient modeling of long-text but also allows highly paralleled training 30 , thus reducing tokens. After training, the decoder-only LLMs generate sequences autoregressively
training costs. They make the Transformer highly scalable, and therefore it is efficient to (i.e. token-by-token). The examples are the GPT-series developed by OpenAI 6,7 , the
obtain LLMs through the large-scale pre-training strategy. LLaMA-series developed by Meta 4,5 , and the PaLM 3 and Bard (Gemini) 34 developed
Large-scale Pre-training The LLMs are trained on massive corpora of unlabeled texts by Google. Based on the LLaMA model, Alpaca 35 is fine-tuned with 52k self-instructed
(e.g., CommonCrawl, Wiki, and Books) to learn rich linguistic knowledge and language data supervision. In addition, Baichuan 36 is trained on approximately 1.2 trillion tokens
patterns. The common training objectives are masked language modeling (MLM) and that support bilingual communication in Chinese and English. These LLMs have been
next token prediction (NTP). In MLM, a portion of the input text is masked, and the model used successfully in language generation.
is tasked with predicting the masked text based on the remaining unmasked context, Encoder-decoder LLMs are designed to simultaneously process input sequences and
encouraging the model to capture the semantic and syntactic relationships between generate output sequences. They consist of a stack of bidirectional Transformer encoder
tokens 30 ; NTP is another common training objective, where the model is required to layers followed by a stack of unidirectional Transformer decoder layers. The encoder
predict the next token in a sequence given the previous tokens. It helps the model to processes and understands the input sequences, while the decoder generates the output
predict the next token 6 . sequences 8,9,37 . Representative examples of encoder-decoder LLMs include Flan-T5 38 ,
Scaling Laws LLMs are the scaled-up versions of Transformer architecture 29 with and ChatGLM 8,9 . Specifically, ChatGLM 8,9 has 6.2B parameters and is a conversational
increased numbers of Transformer layers, model parameters, and volume of pre-training open-source LLM specially optimized for Chinese to support Chinese-English bilingual
data. The “scaling laws” 31,32 predict how much improvement can be expected in a question-answering.
Table 1. Summary of existing general (large) language models, their underlying structures, numbers of parameters,
and datasets used for model training. Column “# params” shows the number of parameters, M: million, B: billion.
3/29
Table 2. Summary of existing medical-domain LLMs, in terms of their model development, the number of parameters (#
params), the scale of pre-training/fine-tuning data, and the data source. M: million, B: billion.
Domains Model Development Models # Params Data Scale Data Source
BioBERT 49 110M 18B tokens PubMed 50 +PMC 51
PubMedBERT 52 110M/340M 3.2B tokens PubMed 50 +PMC 51
SciBERT 53 110M 3.17B tokens Literature 54
ClinicalBERT 55 110M 112k clinical notes MIMIC-III 56
BioM-ELECTRA 57 110M/335M - PubMed 50
BioMed-RoBERTa 58 125M 7.55B tokens S2ORC 59
Pre-training BioLinkBERT 60 110M/340M 21GB PubMed 50
(Sec. 2.1)
SciFive 61 220M/770M - PubMed 50 +PMC 51
ClinicalT5 62 220M/770M 2M clinical notes MIMIC-III 56
BlueBERT 63,64,65 110M/340M >4.5B tokens PubMed 50 +MIMIC-III 56
MedCPT 66 330M 255M articles PubMed 50
BioGPT 67 1.5B 15M articles PubMed 50
BioMedLM 68 2.7B 110GB Pile 69
OphGLM 70 6.2B 20k dialogues MedDialog 71
Chat-Orthopedist 103
ChatGPT Retrieval-Augmented PubMed+Guidelines 104 +
Generation (RAG) UpToDate 105 +Dyname 106
107
QA-RAG ChatGPT RAG FDA QA 107
Almanac 108 ChatGPT RAG & CoT Clinical QA 108
2.1 Pre-training
Pre-training typically involves training an LLM on a large corpus of medical texts, including both structured and unstructured
text, to learn the rich medical knowledge. The corpus may include EHRs 72 , clinical notes 23 , and medical literature 55 . In
particular, PubMed 50 , MIMIC-III clinical notes 56 , and PubMed Central (PMC) literature 51 , are three widely used medical
corpora for medical LLM pre-training. A single corpus or a combination of corpora may be used for pre-training. For example,
PubMedBERT 52 and ClinicalBERT are pre-trained on PubMed and MIMIC-III, respectively. In contrast, BlueBERT 63
combines both corpora for pre-training; BioBERT 49 is pre-trained on both PubMed and PMC. The University of Florida (UF)
Health EHRs are further introduced in pre-training GatorTron 23 and GatorTronGPT 72 . MEDITRON 91 is pre-trained on Clinical
Practice Guidelines (CPGs). The CPGs are used to guide healthcare practitioners and patients in making evidence-based
decisions about diagnosis, treatment, and management.
To meet the needs of the medical domain, pre-training medical LLMs typically involve refining the following commonly
used training objectives in general LLMs: masked language modeling, next sentence prediction, and next token prediction
(Please see Box 1 for an introduction of these three pre-training objectives). For example. BERT-series models (e.g., BioBERT 49 ,
PubMedBERT 52 , ClinicalBERT 55 , and GatorTron 23 ) mainly adopt the masked language modeling and the next sentence
prediction for pre-training; GPT-series models (e.g., BioGPT 67 , and GatorTronGPT 72 ) mainly adopt the next token prediction
4/29
10000
540B
175B MedPaLM
1000
Clinical
100 Camel
Scaling up
GatorTron
GPT
8.9B
MedAlp AlpaCar PMC-
aca e LLaMA
10 1.5B Huatuo Clinical
BioMed GPT GPT
GatorTron ChatDoc Qilin-
BenTsao LM
BianQue tor Med
GPT 2 BioGPT
OphGL
1 M
BlueBE PubMed Med
RT BERT CPT
Clinic
BioB SciB
alBE
ERT ERT
0.1 RT
0.01
BERT-like ChatGLM/LLaMA-like GPT/PaLM-like
Figure 2. We adopt the data from Table 2 to demonstrate the development of model sizes for medical large language models
in different model architectures, i.e., BERT-like, ChatGLM/LLaMA-like, and GPT/PaLM-like.
for pre-training. It is worth mentioning that BERT-like Medical LLMs (e.g. BioBERT 49 , PubMedBERT 52 , Clinical BERT 55 )
are originally derived from the general domain BERT or RoBERTa models. To clarify the differences between different models,
in our Table 2, we only show the data source used to further construct medical LLMs. After pre-training, medical LLMs can
learn rich medical knowledge that can be leveraged to achieve strong performance on different medical tasks.
2.2 Fine-tuning
It is high-cost and time-consuming to train a medical LLM from scratch, due to its requirement of substantial (e.g. several
days or even weeks) computational power and manual labor. One solution is to fine-tune the general LLMs with medical
data, and researchers have proposed different fine-tuning methods 11,16,18 for learning domain-specific medical knowledge and
obtaining medical LLMs. Current fine-tuning methods include Supervised Fine-Tuning (SFT), Instruction Fine-Tuning (IFT),
and Parameter-Efficient Fine-Tuning (PEFT). The resulting fine-tuned medical LLMs are summarized in Table 2.
Supervised Fine-Tuning (SFT) aims to leverage high-quality medical corpus, which can be physician-patient conversations 15 ,
medical question-answering 16 , and knowledge graphs 77,17 .The constructed SFT data serves as a continuation of the pre-training
data to further pre-train the general LLMs with the same training objectives, e.g. next token prediction. SFT provides an
additional pre-training phase that allows the general LLMs to learn rich medical knowledge and align with the medical domain,
thus transforming them into specialized medical LLMs.
The diversity of SFT enables the development of diverse medical LLMs by training on different types of medical corpus.
For example, DoctorGLM 73 and ChatDoctor 15 are obtained by fine-tuning the general LLMs ChatGLM 8,9 and LLaMA 4 on
the physician-patient dialogue data, respectively. MedAlpaca 16 based on the general LLM Alpaca 35 is fine-tuned using over
160,000 medical QA pairs sourced from diverse medical corpora. Clinicalcamel 18 combines physician-patient conversations,
clinical literature, and medical QA pairs to refine the LLaMA-2 model 5 . In particular, Qilin-Med 77 and Zhongjing 86 are
obtained by incorporating the knowledge graph to perform fine-tuning on the Baichuan 36 and LLaMA 4 , respectively.
In summary, existing studies have demonstrated the efficacy of SFT in adapting general LLMs to the medical domain. They
show that SFT improves not only the model’s capability for understanding and generating medical text, but also its ability to
provide accurate clinical decision support 109 .
Instruction Fine-Tuning (IFT) constructs instruction-based training datasets 110,109,1 , which typically comprise instruction-
input-output triples, e.g. instruction-question-answer. The primary goal of IFT is to enhance the model’s ability to follow
various human/task instructions, align their outputs with the medical domain, and thereby produce a specialized medical LLM.
Thus, the main difference between SFT and IFT is that the former focuses primarily on injecting medical knowledge into a
general LLM through continued pre-training, thus improving its ability to understand the medical text and accurately predict
the next token. In contrast, IFT aims to improve the model’s instruction following ability and adjust its outputs to match the
given instructions, rather than accurately predicting the next token as in SFT 110 . As a result, SFT emphasizes the quantity
of training data, while IFT emphasizes their quality and diversity. Since IFT and SFT are both capable of improving model
performance, there have been some recent works 86,77,85 attempting to combine them for obtaining robust medical LLMs.
5/29
In other words, to enhance the performance of LLMs through IFT, it is essential to ensure that the training data for IFT are
of high quality and encompass a wide range of medical instructions and medical scenarios. To this end, MedPaLM-2 11 invited
qualified medical professionals to develop the instruction data for fine-tuning the general PaLM. BenTsao 17 and ChatGLM-
Med 111 constructed the knowledge-based instruction data from the knowledge graph. Zhongjing 86 further incorporated the
multi-turn dialogue as the instruction data to perform IFT. MedAlpaca 16 simultaneously incorporated the medical dialogues
and medical QA pairs for instruction fine-tuning.
Parameter-Efficient Fine-Tuning (PEFT) aims to substantially reduce computational and memory requirements for fine-
tuning general LLMs. The main idea is to keep most of the parameters in pre-trained LLMs unchanged, by fine-tuning only the
smallest subset of parameters (or additional parameters) in these LLMs. Commonly used PEFT techniques include Low-Rank
Adaptation (LoRA) 112 , Prefix Tuning 113 , and Adapter Tuning 114,115 .
In contrast to fine-tuning full-rank weight matrices, 1) LoRA preserves the parameters of the original LLMs and only adds
trainable low-rank matrices into the self-attention module of each Transformer layer 112 . Therefore, LoRA can substantially
reduce the number of trainable parameters and improve the efficiency of fine-tuning, while still enabling the fine-tuned LLM to
capture effectively the characteristics of the tasks. 2) Prefix Tuning takes a different approach from LoRA by adding a small
set of continuous task-specific vectors (i.e. “prefixes”) to the input of each Transformer layer 113,1 . These prefixes serve as
the additional context to guide the generation of the model without changing the original pre-trained parameter weights. 3)
Adapter Tuning involves introducing small neural network modules, known as adapters, into each Transformer layer of the
pre-trained LLMs 116 . These adapters are fine-tuned while keeping the original model parameters frozen 116 , thus allowing for
flexible and efficient fine-tuning. The number of trainable parameters introduced by adapters is relatively small, yet they enable
the LLMs to adapt to clinical scenarios or tasks effectively.
In general, PEFT is valuable for developing LLMs that meet unique needs in specific (e.g. medical) domains, due
to its ability to reduce computational demands while maintaining the model performance. For example, medical LLMs
DoctorGLM 73 , MedAlpaca 16 , Baize-Healthcare 82 , Zhongjing 86 , CPLLM 87 , and Clinical Camel 18 adopted the LoRA 112 to
perform parameter-efficient fine-tuning to efficiently align the general LLMs to the medical domain.
2.3 Prompting
Fine-tuning considerably reduces computational costs compared to pre-training, but it requires further model training and
collections of high-quality datasets for fine-tuning, thus still consuming some computational resources and manual labor. In
contrast, the “prompting” methods efficiently align general LLMs (e.g. PaLM 3 ) to the medical domain (e.g., MedPaLM 10 ),
without training any model parameters. Popular prompting methods include In-Context Learning (ICL), Chain-of-Thought
(CoT) prompting, Prompt Tuning, and Retrieval-Augmented Generation (RAG).
In-Context Learning (ICL) aims to directly give instructions to prompt the LLM to perform a task efficiently. In general, the
ICL consists of four process: task understanding, context learning, knowledge reasoning, and answer generation. First, the
model must understand the specific requirements and goals of the task. Second, the model learns to understand the contextual
information related to the task with argument context. Then, use the model’s internal knowledge and reasoning capabilities to
understand the patterns and logic in the example. Finally, the model generates the task-related answers. The advantage of ICL is
that it does not require a large amount of labeled data for fine-tuning. Based on the type and number of input examples, ICL can
be divided into three categories 117 . 1) One-shot Prompting: Only one example and task description are allowed to be entered.
2) Few-shot Prompting: Allows the input of multiple instances and task descriptions. 3) Zero-shot Prompting: Only task
descriptions are allowed to be entered. ICL presents the LLMs making task predictions based on contexts augmented with a few
examples and task demonstrations. It allows the LLMs to learn from these examples or demonstrations to accurately perform
the task and follow the given examples to give corresponding answers 6 . Therefore, ICL allows LLMs to accurately understand
and respond to medical queries. For example, MedPaLM 10 substantially improves the task performance by providing the
general LLM, PaLM 3 , with a small number of task examples such as medical QA pairs.
Chain-of-Thought (CoT) Prompting further improves the accuracy and logic of model output, compared with In-Context
Learning. Specifically, through prompting words, CoT aims to prompt the model to generate intermediate steps or paths of
reasoning when dealing with downstream (complex) problems 98 . Moreover, CoT can be combined with few-shot prompting by
giving reasoning examples, thus enabling medical LLMs to give reasoning processes when generating responses. For tasks
involving complex reasoning, such as medical QA, CoT has been shown to effectively improve model performance 10,11 . Medical
LLMs, such as DeID-GPT 99 , MedPaLM 10 , and MedPrompt 12 , use CoT prompting to assist them in simulating a diagnostic
thought process, thus providing more transparent and interpretable predictions or diagnoses. In particular, MedPrompt 12
directly prompts a general LLM, GPT-4 7 , to outperform the fine-tuned medical LLMs on medical QA without training any
model parameters.
6/29
Prompt Tuning aims to improve the model performance by employing both prompting and fine-tuning techniques 118,115 . The
prompt tuning method introduces learnable prompts, i.e. trainable continuous vectors, which can be optimized or adjusted
during the fine-tuning process to better adapt to different medical scenarios and tasks. Therefore, they provide a more flexible
way of prompting LLMs than the “prompting alone” methods that use discrete and fixed prompts, as described above. In contrast
to traditional fine-tuning methods that train all model parameters, prompt tuning only tunes a very small set of parameters
associated with the prompts themselves, instead of extensively training the model parameters. Thus, prompt tuning effectively
and accurately responds to medical problems 12 , with minimal incurring computational cost.
Existing medical LLMs that employ the prompting techniques are listed in Table 2. Recently, MedPaLM 10 and MedPaLM-
11
2 propose to combine all the above prompting methods, resulting in Instruction Prompt Tuning, to achieve strong performances
on various medical question-answering datasets. In particular, using the MedQA dataset for the US Medical Licensing
Examination (USMLE), MedPaLM-2 11 achieves a competitive overall accuracy of 86.5% compared to human experts (87.0%),
surpassing previous state-of-the-art method MedPaLM 10 by a large margin (19%).
Retrieval-Augmented Generation (RAG) enhances the performance of LLMs by integrating external knowledge into the
generation process. In detail, RAG can be used to minimize LLM’s hallucinations, obscure reasoning processes, and reliance
on outdated information by incorporating external database knowledge 119 . RAG consists of three main components: retrieval,
augmentation, and generation. The retrieval component employs various indexing strategies and input query processing
techniques to search and top-ranked relevant information from an external knowledge base. The retrieved external data is then
augmented into the LLM’s prompt, providing additional context and grounding for the generated response. By directly updating
the external knowledge base, RAG mitigates the risk of catastrophic forgetting associated with model weight modifications,
making it particularly suitable for domains with low error tolerance and rapidly evolving information, such as the medical
field. In contrast to traditional fine-tuning methods, RAG enables the timely incorporation of new medical information without
compromising the model’s previously acquired knowledge, ensuring the generated outputs remain accurate and up-to-date
in the face of evolving medical challenges. Most recently, researchers proposed the first benchmark MIRAGE 120 based on
medical information RAG, including 7,663 questions from five medical QA datasets, which has been established to both steer
research and facilitate the practical deployment of medical RAG systems
In RAG, retrieval can be achieved by calculating the similarity between the embeddings of the question and document
chunks, where the semantic representation capability of embedding models plays a key role. Recent research has introduced
prominent embedding models such as AngIE 121 , Voyage 122 , and BGE 123 . In addition to embedding, the retrieval process
can be optimized via various strategies such as adaptive retrieval, recursive retrieval, and iterative retrieval 124,125,126 . Several
recent works have demonstrated the effectiveness of RAG in medical and pharmaceutical domains. Almanac 108 is a large
language framework augmented with retrieval capabilities for medical guidelines and treatment recommendations, surpassing
the performance of ChatGPT on clinical scenario evaluations, particularly in terms of completeness and safety. Another work
QA-RAG 107 employs RAG with LLM for pharmaceutical regulatory tasks, where the model searches for relevant guideline
documents and provides answers based on the retrieved guidelines. Chat-Orthopedist 103 , a retrieval-augmented LLM, assists
adolescent idiopathic scoliosis (AIS) patients and their families in preparing for meaningful discussions with clinicians by
providing accurate and comprehensible responses to patient inquiries, leveraging AIS domain knowledge.
2.4 Discussion
This section discusses the principles of medical LLMs, including three types of methods for building models: pre-training,
fine-tuning, and prompting. To meet the needs of practical medical applications, users can choose proper medical LLMs
according to the magnitude of their own computing resources. Companies or institutes with massive computing resources
can either pre-train an application-level medical LLM from scratch or fine-tune existing open-source general LLM models
(e.g. LLaMA 43 ) using large-scale medical data. The results in existing literature (e.g. Med-PaLM2 11 , MedAlpaca 16 and
Clinical Camel 18 ) have shown that fine-tuning general LLMs on medical data can boost their performance of medical tasks.
For example, Clinical Camel 18 , which is fine-tuned on the LLaMA-2-70B 5 model, even outperforms GPT-4 18 . However, for
small enterprises or individuals with certain computing resources, combining with the understanding of medical tasks and a
reasonable combination of ICL, prompting engineering, and RAG, to prompt black-box LLMs may also achieve miraculous
results. For example, MedPrompt 12 stimulates the commercial LLM GPT-4 7 through an appropriate combination of prompt
strategies to achieve comparable or even better results than fine-tuned medical LLMs (e.g. Med-PaLM2 11 ) and human experts,
suggesting that a mix of prompting strategies is an efficient and green solution in the medical domain rather than fine-tuning.
3 Medical Tasks
In this section, we will introduce two popular types of medical machine learning tasks: generative and discriminative tasks,
including ten representative tasks that further build up clinical applications. Figure 3 illustrates the performance comparisons
7/29
Medical Large Language Model for
Medical Tasks
USMLE - Acc.
(Question Answering)
88
TREC-COVID - NDCG@10 PubMedQA - Acc.
(Information Retrieval) (Question Answering)
77 81
85
80 66
72
BIOSSES - F1 MedMCQA - Acc.
(Semantic Textual Similarity) 75 (Question Answering)
55 84
94 63 72
70
92 60
44
48
90 65 36
88
50 60 70 80
36 72 90
39 NCBI Disease - F1
NFCorpus - NDCG@10 42 50 (Entity Extraction)
(Information Retrieval) 81 78
84
63 60
84 90
70 70 96
87
MedNLI - F1 BC5CDR Drug/Chem. - F1
77 80 (Entity Extraction)
(Natural Language Inference)
84 90
MIMIC-III - F1 DDI - F1
(Text Classification) (Relation Extraction)
GPT-3.5-turbo GPT-4 Fine-tuned Human Expert
Figure 3. Performance (Dataset-Metric (Task)) comparison between the GPT-3.5 turbo, GPT-4, state-of-the-art task-specific
lightweight models (Fine-tuned), and human experts, on seven medical tasks across eleven datasets. All data presented in our
Figures originates from published and peer-reviewed literature. Please refer to the supplementary material for the detailed data.
between different LLMs. For clarity, we will only cover a general discussion of those tasks. The detailed definition of the task
and the performance comparisons can be found in our supplementary material.
8/29
Prompting
Pre-training
Fine-tuning
Human
(expert)
Figure 4. We demonstrate the development of medical large language models over time in different model development types
through the scores of the United States Medical Licensing Examination (USMLE) from the MedQA dataset.
fine-tuned model BioBERT 49 achieves an F1 score of 89.36, substantially exceeding the F1 score of 56.73 by GPT-4. We
hypothesize that the reason for the strong QA capability of the current general LLMs is that the QA task is close-ended; i.e. the
correct answer is already provided by multiple candidates. In contrast, most non-QA tasks are open-ended where the model has
to predict the correct answer from a large pool of possible candidates, or even without any candidates provided.
Overall, the comparison proves that the current general LLMs have a strong question-answering capability, however, the
capability on other tasks still needs to be improved. Therefore, we advocate that the evaluation of medical LLMs should be
extended to a broad range of tasks including non-QA tasks, instead of being limited mainly to medical QA tasks. Hereafter, we
will discuss specific clinical applications of LLMs, followed by their challenges and future directions.
4 Clinical Applications
As shown in Figure 5, this section discusses the clinical applications of LLMs. Each subsection contains a specific application
and discusses how LLMs perform this task. Table 3 summarizes the guidelines on how to select, build, and evaluate medical
LLMs for various clinical applications.
9/29
Clinical Applications
Ultrasound Scanning
LLM
Multi-Agent Planner for Surgery Translation to Other Language Translation to Lay Person
Figure 5. Integrated overview of potential applications 101,136,137,138,139 of large language models in medicine.
10/29
Table 3. Summary of existing medical LLMs tailored to various clinical applications, in terms of their architecture, model
development, the number of parameters, the scale of PT/FT data, and the data source. M: million, B: billion. PT: Pre-training.
FT: Fine-tuning. ICL: In-Context Learning. CoT: Chain-of-Thought prompting. RAG: Retrieval-Augmented Generation.
Application Model Architecture Model Development # Params Data Scale Data Source
Dr. Knows 101 GPT-3.5 ICL 154B 5820 notes MIMIC-III 56 +IN-HOUSE 101
DDx PaLM-2 140 PaLM-2 FT & ICL 340B - MultiMedQA 11 +MIMIC-III 56
VQA-RAD (radiology) 142 , SLAKE (radiology) 145 , and PathVQA (pathology) 143 are frequently employed. Most benchmarking
efforts involve both quantitative evaluation metrics and human evaluations. These models have demonstrated their effectiveness
and the potential for substantial improvements in medical diagnosis tasks.
Discussion One distinct limitation of using LLMs as the sole tool for medical diagnosis is the heavy reliance on subjective
text inputs from the patient. Since LLMs are text-based, they lack the inherent capability to analyze medical diagnostic
imagery. Given that objective medical diagnoses frequently depend on visual images, LLMs are often unable to directly conduct
diagnostic assessments as they lack concrete visual evidence to support disease diagnosis 185 . However, they can help with
diagnosis as a logical reasoning tool for improving accuracy in other vision-based models. One such example is ChatCAD 100 ,
where images are first fed into an existing computer-aided diagnosis (CAD) model to obtain tensor outputs. These outputs
are translated into natural language, which is subsequently fed into ChatCAD to summarize results and formulate diagnoses.
ChatCAD achieves a recall score of 0.781, substantially higher than that (0.382) of the state-of-the-art task-specific model.
Nevertheless, all the aforementioned methods of implementing LLMs cannot directly process images; instead, they either rely
on transforming images into text beforehand or rely on an external separate vision encoder to embed images.
11/29
4.2 Formatting and ICD-Coding
The international classification of diseases (ICD) 128 is a method of standardizing diagnostic and procedural information of a
clinical session. These ICD codes are recorded in the patient’s EHRs every doctor visit. They are also used for tracking health
metrics, treatment outcomes, and billing. There is a need to automate the ICD labeling process because its manual entry process
is very time-consuming for doctors. Formatting and ICD-Coding usually involve entity extraction, relation extraction, text
generation, and information retrieval. LLMs can help automate ICD coding by extracting medical terms from clinical notes and
assigning corresponding ICD codes 186,136 .
Guideline For example, PLM-ICD 146 builds upon the RoBERTa model 39 , fine-tuning it specifically for ICD coding and
achieving strong performance on 70,539 notes from the MIMIC-II and MIMIC-III datasets 56 , as evaluated by accuracy. The
base model used in PLM-ICD is domain-specific with medicine-specific knowledge to enhance the ability to understand medical
terms. PLM-ICD uses segment pooling, the algorithm that divides long input texts into shorter representations using LLMs
when the input surpasses the maximum allowable length. Lastly, it relates the encoding to the augmented labels to output ICD
codes for each clinical input. PLM-ICD produced a higher AUC score than previous state-of-the-art lightweight models 146 .
DRG-LLaMA 148 leverages the LLaMA model and applies parameter-efficient fine-tuning techniques, such as LoRA, to adapt
the model to this task. ChatICD 150 and LLM-codex 151 both utilize the ChatGPT model with prompts for ICD coding. However,
LLM-codex 151 takes this a step further by training an LSTM model on top of the ChatGPT responses, demonstrating its
strong performance. ICD coding can be formulated as a multi-label classification task, and most work in this area utilizes the
MIMIC-III dataset for training and evaluation. Models are typically assessed based on their F1 score, AUC, and Precision@k,
considering either the top 50 most frequent labels or the full label set.
Discussion One challenge while deploying LLMs for clinical coding is the potential biases and hallucinations. In particular,
traditional multi-label classification models can easily constrain their outputs to a predefined list of (usually >1000) ICD
candidate codes through a classification neural network. In contrast, generative LLMs can suffer from major hallucinations
while the input text is lengthy. As a result, the LLM may assign an ICD code that is not in the candidate list or a non-existent
ICD code to the input text. It leads to confusion when interpreting medical records 23 and is, therefore, crucial to establish a
proactive mechanism to detect and rectify errors before they enter patient EHRs.
12/29
generation primarily focus on lexical metrics, which can lead to biased and inaccurate assessments of the contextual information
present in the reports 197 . For instance, consider two sentences with similar meanings but different wordings: “The patient’s
blood glucose level is within normal limits” and “The patient does not exhibit signs of hyperglycemia”. While both convey the
absence of hyperglycemia, lexical evaluation metrics may struggle to accurately capture their semantic equivalence, as they
rely on direct word-level comparisons. This discrepancy highlights the need for more sophisticated evaluation techniques that
can account for the nuances and variations in expressing clinical information. Developing evaluation methods that go beyond
surface-level similarities and consider the underlying medical context is crucial for ensuring the reliability and usefulness of
LLMs in generating clinical reports.
13/29
Guideline Medical mT5 161 , Apollo 164 , and BiMediX 165 are multilingual large language models in the medical domain.
Medical mT5 161 , which is based on multilingual T5 (mT5) with 738 million / 3 billion parameters, is trained on 4.5 billion
tokens consisting of various languages languages, i.e., English, French, Italian, and Spanish. Apollo 164 supports English,
Chinese, French, Spanish, Arabic, and Hindi based on the Qwen model at various relatively small sizes (i.e., 0.5B, 1.8B,
2B, 6B, and 7B), achieving the best performance among models of equivalent size. BiMediX 165 is a bilingual medical
mixture-of-experts language model for English and Arabic, proposing a semi-automated English-to-Arabic translation pipeline
with human refinement for high-quality translations. For medical translation to lay language, a work aims to enhance the
performance of language models in biomedical abstractive summarisation by aggregating knowledge from external papers cited
within the source article 166 . It proposes a novel attention-based citation aggregation model that integrates domain-specific
knowledge from citation papers, allowing neural networks to generate summaries by leveraging both the paper content and
relevant knowledge from citation papers 166 . Another work introduces Retrieval-Augmented Lay Language (RALL) generation
with a large and broad-ranging 63k lay language generation pairs from 12 journals, intuitively fitting the need for external
knowledge beyond expert-authored source documents 167 . It also evaluates the ability of both an open-source LLaMA-2 and
closed-source GPT-4 in background explanation, with and without retrieval augmentation.
Discussion In both translation and simplification tasks, misinterpretation is a common occurrence that can have damaging
consequences. In developing and deploying medical translation and simplification platforms, developers should prioritize
professional datasets, such as textbooks and peer-reviewed journals for medical knowledge recall. This way, it will be less likely
for misinformation from unreliable web sources to skew the output 209 . Another ethical consideration of using LLMs to perform
medical translation is the potential for discriminatory verbiage to be inserted inadvertently into the output. Such verbiage is
difficult to prevent due to the nature of the pipeline. This may cause miscommunications and even have legal consequences. 207 .
14/29
4.7 Mental Health Support
Mental health support involves both diagnosis and treatment. For example, depression is treated through a variety of
psychotherapies, including cognitive behavior therapy, interpersonal psychotherapy, psychodynamic therapy, etc. 139 . Many of
these techniques are primarily dominated by patient-doctor conversations, with lengthy treatment plans that are cost-prohibitive
for many. The ability of LLMs to serve as conversation partners and companions may lower the barrier to entry for patients
with financial or physical constraints 211 , increasing the accessibility to mental health treatments 170 . There have been various
research works and discussions on the effects of incorporating LLMs into the treatment plan 170,212,213 .
The level of self-disclosure has a heavy impact on the effectiveness of mental health diagnosis and treatment. The degree of
willingness to share has a direct impact on the diagnosis results and treatment plan. Studies have shown that patient willingness
to discuss mental health-related topics with a robot is high 214,212 . Alongside the convenience and lower financial stakes, mental
health support by LLMs has the potential to be more effective than human counterparts in many scenarios.
Guideline Development and deployment of LLMs targeted at mental health support can start with an existing LLM. Instead
of pre-training or fine-tuning on general medical data, it is often better to use medical question and answer data as most of
the LLM’s work will be talking to the patient, which involves back-and-forth conversation in the format of question and
answering 215 . PsyChat 169 is a client-centric LLM dialogue system that provides psychological support comprising five
modules: client behavior recognition, counselor strategy selection, input packer, response generator, and response selection.
Specifically, the response generator is fine-tuned with ChatGLM-6B with a vast dialogue dataset. Through both automatic
and human evaluations, the system has demonstrated its effectiveness and practicality in real-life mental health support
scenarios. ChatCounselor is designed to provide mental health support. It initializes from Vicuna and fine-tunes from an 8k
size instruct-tuning dataset collected from real-world counseling dialogue examples 170 . Psy-LLM is an LLM aimed to be an
assistive mental health tool to support the workflow of professional counselors, particularly to support those who might be
suffering from depression or anxiety 215 . Another work presents a comprehensive evaluation of prompt engineering, few-shot,
and fine-tuning techniques on multiple LLMs in the mental health domain 171 . The results reveal that fine-tuning on a variety of
datasets can improve LLM’s capability on multiple mental-health-specific tasks across different datasets simultaneously 171 .
The work also releases their model Mental-Alpaca and Mental-FLAN-T5 as open-source LLMs targeted at multiple mental
health prediction tasks 171 .
Discussion Two of the most critical difficulties in employing LLMs for mental health support are the lack of emotional
understanding and the risk of inappropriate or harmful responses 216 . LLMs, being language models, may struggle to fully
grasp and respond to the complex emotional states and needs of individuals seeking mental health support. They may not be
able to provide the same level of empathy and human connection that is crucial in therapeutic interactions.
Moreover, if not properly trained or controlled, LLMs may generate responses that are inappropriate, insensitive, or even
harmful to individuals in vulnerable emotional states 217 . They may provide advice that is not grounded in evidence-based
psychological practices or that goes against established mental health guidelines. Addressing these challenges requires rigorous
training of LLMs in evidence-based practices, ethical considerations, and risk assessment protocols, as well as collaboration
between mental health professionals and AI researchers.
15/29
Discussion However, there is still far from deploying them in the real-world healthcare system. Several challenges must be
addressed before widespread deployment in real-world healthcare settings. One major concern is the potential for biased or
inaccurate outputs, which could lead to improper medical advice or misdiagnosis 210 . Rigorous testing and validation across
diverse patient populations and medical contexts are essential to ensure the reliability and generalizability of these systems.
Additionally, the integration of medical LLMs into existing healthcare workflows and infrastructure may require substantial
technical and organizational efforts. Privacy and security concerns surrounding patient data must also be carefully considered
and addressed.
Furthermore, the development and deployment of medical LLMs raise important ethical and responsible AI considerations.
Ensuring transparency, explainability, and accountability in the decision-making processes of these systems is crucial to
maintaining trust and facilitating informed consent from patients 219,220 . The potential impact on the doctor-patient relationship
and the role of human physicians in an AI-assisted healthcare setting must also be carefully examined. Ongoing collaboration
between AI researchers, healthcare professionals, ethicists, and policymakers will be necessary to establish guidelines and best
practices for the responsible development and deployment of medical LLMs in real-world healthcare settings.
5 Challenges
We address the challenges and discuss solutions to the adoption of LLMs in an array of medical applications.
5.1 Hallucination
Hallucination of LLMs refers to the phenomenon where the generated output contains inaccurate or nonfactual information.
It can be categorized into intrinsic and extrinsic hallucinations 221,210 . Intrinsic hallucination generates outputs logically
contradicting factual information, such as wrong calculations of mathematical formulas 210 . Extrinsic hallucination happens
when the generated output cannot be verified, typical examples include LLMs ‘faking’ citations that do not exist or ‘dodging’
the question. When integrating LLMs into the medical domain, fluent but nonfactual LLM hallucinations can lead to the
dissemination of incorrect medical information, causing misdiagnoses, inappropriate treatments, and harmful patient education.
It is therefore vital to ensure the accuracy of LLM outputs in the medical domain.
Potential Solutions Current solutions to mitigate LLM hallucination can be categorized into training-time correction,
generation-time correction, and retrieval-augmented correction. The first (i.e. training-time correction) adjusts model parameter
weights, thus reducing the probability of generating hallucinated outputs. Its examples include factually consistent reinforcement
learning 222 and contrastive learning 223 . The second (i.e. generation-time correction) adds a ‘reasoning’ process to the LLM
inference to ensure reliability, using drawing multiple samples 224 or a confidence score to identify hallucination before the final
generation. The third approach (i.e. retrieval-augmented correction) utilizes external resources to mitigate hallucination, for
example, using factual documents as prompts 225 or chain-of-retrieval prompting technique 226 .
16/29
Potential Solutions Current state-of-the-art approaches 11,15 typically fine-tune the LLMs on smaller open-sourced datasets
to improve their domain-specific performance. Another solution is to generate high-quality synthetic datasets using LLMs to
broaden the knowledge coverage; however, it has been discovered that training on generated datasets causes models to forget 230 .
Future research is needed to validate the effectiveness of using synthetic data for LLMs in the medical field.
17/29
Future Directions
Development Deployment
2. Clinical Skill 2. Integration of Visual, 2. Specialized role 2. Potential in Sports 2. Real-world testing
Audio, and Language modeling with LLMs Medicine and evaluation
Evaluation
Figure 6. Future directions of LLMs in clinical medicine in terms of both development and deployment.
regulators to develop adaptable, foresightful frameworks to ensure the safety, ethical standards, and privacy of the new family
of LLMs-powered medical technologies.
Potential Solutions To address the complex regulatory challenges without hindering innovation, regulators should devise
adaptive, flexible, and robust frameworks. Drawing on the insights from Mesko and Topol 243 , creating a dedicated regulatory
category and implementing patient design to enhance decision-making for LLMs used for medical purposes can better address
their unique attributes and minimize harm. Furthermore, the insights outlined by Derraz et al. 244 emphasize the importance of
implementing agile regulatory frameworks that can keep pace with the fast-paced advancements in personalized applications.
Researchers both inside 243,244 and outside of healthcare 245,246 have proposed innovative strategies to regulate the use of LLMs
involving (i) assessing LLMs-enabled applications in real-world settings, (ii) obligations of transparency of data and algorithms,
(iii) adaptive risk assessment and mitigation processes, (iv) continuous testing and refinement of audited technologies. Such
proactive regulatory adaptations are crucial to maintaining high standards of safety, ethics, and trustworthiness of medical
technology.
6 Future Directions
Although LLMs have already made an impact on people’s lives through chatbots and search engines, their integration into
medicine is still in the infant stage. As shown in Figure 6, numerous new avenues of medical LLMs await researchers and
practitioners to explore how to better serve the general public and patients.
18/29
included in traditional benchmarks, (iii) physician-in-the-loop benchmarks to evaluate the performance of LLMs leveraging
their human counterparts or users.
6.2 Multimodal LLM Integrated with Time-Series, Visual, and Audio Data
Multimodal LLMs (MLLMs), or Large Multimodal Models (LMMs), are LLM-based models designed to perform multimodal
(e.g. involving both visual and textual) tasks 249 . While LLMs primarily address NLP tasks, MLLMs support a broader range of
tasks, such as comprehending the underlying meaning of a meme and generating website codes from images. This versatility
suggests promising applications of MLLMs in medicine. Several MLLM-based frameworks integrating vision and language, e.g.
MedPaLM M 250 , LLaVA-Med 251 , Visual Med-Alpaca 252 , Med-Flamingo 253 , and Qilin-Med-VL 254 , have been proposed to
adopt the medical image-text pairs for fine-tuning, thus enabling the medical LLMs to efficiently understand the input medical
(e.g. radiology) images. A recent study 255 proposes to integrate vision, audio, and language inputs for automated diagnosis
in dentistry. However, there exist only very few medical LLMs that can process time series data, such as electrocardiograms
(ECGs) 256 and sphygmomanometers (PPGs) 257 , despite such data being important for medical diagnosis and monitoring.
Although early in their proposed research stages, these studies suggest that MLLMs trained at scale have the potential to
effectively generalize across various domains and modalities outside of NLP tasks. However, the training of MLLMs at scale is
still costly and ineffective, resulting in the size of MLLMs being much smaller than LLMs. Moving forward, future research
may focus on (i) more effective processing, representation, and learning of multi-modal data and knowledge, (ii) cost-effective
training of MLLMs, especially modalities that are more resource-demanding such as videos and images, (iii) collecting or
accessing safely, currently unavailable, multi-modal data in medicine and healthcare.
19/29
medicine. The medical community has primarily adopted LLMs provided by technology companies without rigorously
questioning their data training, ethical protocols, or privacy protection. Medical professionals are therefore encouraged to
actively participate in creating and deploying medical LLMs by providing relevant training data, defining the desired benefits of
LLMs, and conducting tests in real-world scenarios to evaluate these benefits 19,21,22 . Such assessments would help to determine
the legal and medical risks associated with LLM use in medicine and inform strategies to mitigate LLM hallucination 264 .
Additionally, training ‘bilingual’ professionals—those versed in both medicine and LLM technology—is increasingly vital
due to the rapid integration of LLMs in healthcare. Future research may explore (i) interdisciplinary frameworks, such as
frameworks to facilitate the sharing of localized data from rural clinics, (ii) ‘bilingual education programs’ that offer training
from both worlds - AI and medicine, (iii) effective in-house development methods to help hospitals and physicians ‘guard’
patient data from corporations while still being able to embrace innovation.
20/29
References
1. Zhao, W. X. et al. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
2. Yang, J. et al. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712
(2023).
3. Chowdhery, A. et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
4. Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
5. Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
6. Brown, T. et al. Language models are few-shot learners. Adv. neural information processing systems 33, 1877–1901
(2020).
7. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
8. Du, Z. et al. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics, 320–335 (2022).
9. Zeng, A. et al. Glm-130b: An open bilingual pre-trained model. In International Conference on Learning Representations
(2022).
10. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
11. Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint
arXiv:2305.09617 (2023).
12. Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv
preprint arXiv:2311.16452 (2023).
13. Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint
arXiv:2304.14454 (2023).
14. Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical
exams. Appl. Sci. 11, 6421 (2021).
15. Li, Y. et al. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical
domain knowledge. arXiv preprint arXiv:2303.14070 (2023).
16. Han, T. et al. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint
arXiv:2304.08247 (2023).
17. Wang, H. et al. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023).
18. Toma, A. et al. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge
encoding. arXiv preprint arXiv:2305.12031 (2023).
19. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. medicine 29, 1930–1940 (2023).
20. Patel, S. B. & Lam, K. Chatgpt: the future of discharge summaries? The Lancet Digit. Heal. 5, e107–e108 (2023).
21. Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: The potentials and
pitfalls. Annals Intern. Medicine (2024).
22. Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Medicine 3, 141 (2023).
23. Yang, X. et al. A large language model for electronic health records. NPJ Digit. Medicine 5, 194 (2022).
24. Abd-Alrazaq, A. et al. Large language models in medical education: Opportunities, challenges, and future directions.
JMIR Med. Educ. 9, e48291 (2023).
25. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. Adv. neural information processing
systems 13 (2000).
26. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J. & Khudanpur, S. Recurrent neural network based language model. In
Interspeech, vol. 2, 1045–1048 (2010).
27. Sundermeyer, M., Ney, H. & Schlüter, R. From feedforward to recurrent lstm neural networks for language modeling.
IEEE/ACM Transactions on Audio, Speech, Lang. Process. 23, 517–529 (2015).
28. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
29. Vaswani, A. et al. Attention is all you need. Adv. neural information processing systems 30 (2017).
30. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 (2018).
31. Kaplan, J. et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
32. Hoffmann, J. et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
33. He, P., Liu, X., Gao, J. & Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint
arXiv:2006.03654 (2021).
34. Google. Bard: A generative artificial intelligence chatbot. https://gemini.google.com (2023).
35. Taori, R. et al. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
(2023).
21/29
36. Yang, A. et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
37. Chung, H. W. et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
38. Joseph, S. et al. Multilingual simplification of medical texts. arXiv preprint arXiv:2305.12532 (2023).
39. Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
40. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).
41. Chiang, W.-L. et al. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality (2023).
42. Jiang, A. Q. et al. Mistral 7b. arXiv preprint arXiv:2310.06825 (2023).
43. Meta llama 3. https://github.com/meta-llama/llama3 (2024).
44. Bai, J. et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
45. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst.
35, 27730–27744 (2022).
46. Claude. https://www.anthropic.com/claude (2024).
47. Lewis, M. et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and
comprehension. arXiv preprint arXiv:1910.13461 (2019).
48. Tay, Y. et al. Ul2: Unifying language learning paradigms. In International Conference on Learning Representations
(2022).
49. Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics
36, 1234–1240 (2020).
50. National Institutes of Health. PubMed Corpora (https://pubmed.ncbi.nlm.nih.gov/download/). In National Library of
Medicine (2022).
51. https://www.ncbi.nlm.nih.gov/pmc/.
52. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions
on Comput. for Healthc. (HEALTH) 3, 1–23 (2021).
53. Beltagy, I., Lo, K. & Cohan, A. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676
(2019).
54. Ammar, W. et al. Construction of the literature graph in semantic scholar. arXiv preprint arXiv:1805.02262 (2018).
55. Alsentzer, E. et al. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019).
56. Johnson, A. E. et al. Mimic-iii, a freely accessible critical care database. Sci. data 3, 1–9 (2016).
57. Alrowili, S. & Shanker, V. Large biomedical question answering models with albert and electra. In CLEF (Working
Notes), 213–220 (2021).
58. Gururangan, S. et al. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of Association
for Computational Linguistics (ACL) (2020).
59. Lo, K., Wang, L. L., Neumann, M., Kinney, R. & Weld, D. S. S2orc: The semantic scholar open research corpus. arXiv
preprint arXiv:1911.02782 (2019).
60. Yasunaga, M., Leskovec, J. & Liang, P. Linkbert: Pretraining language models with document links. In Proceedings of
Association for Computational Linguistics (ACL) (2022).
61. Phan, L. N. et al. Scifive: a text-to-text transformer model for biomedical literature. arXiv preprint arXiv:2106.03598
(2021).
62. Lu, Q., Dou, D. & Nguyen, T. Clinicalt5: A generative language model for clinical text. In Findings of the Association
for Computational Linguistics: EMNLP 2022, 5436–5443 (2022).
63. Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of bert and elmo
on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, 58–65 (2019).
64. Mutinda, F. W. et al. Detecting redundancy in electronic medical records using clinical bert. In Proceedings of the Annual
Conference of the Association for Natural Language Processing, 16–19 (2020).
65. Mahajan, D. et al. Identification of semantically similar sentences in clinical notes: Iterative intermediate training using
multi-task learning. JMIR medical informatics 8, e22508 (2020).
66. Jin, Q. et al. Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical
information retrieval. arXiv preprint arXiv:2307.00589 (2023).
67. Luo, R. et al. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinforma.
23, bbac409 (2022).
68. Venigalla, A., Frankle, J. & Carbin, M. Biomedlm: a domain-specific large language model for biomedical text. MosaicML.
Accessed: Dec 23, 2 (2022).
69. Gao, L. et al. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
70. Gao, W. et al. Ophglm: Training an ophthalmology large language-and-vision assistant based on instructions and dialogue.
arXiv preprint arXiv:2306.12174 (2023).
22/29
71. Chen, S. et al. Meddialog: a large-scale medical dialogue dataset. arXiv preprint arXiv:2004.03329 3 (2020).
72. Peng, C. et al. A study of generative large language model for medical research and healthcare. arXiv preprint
arXiv:2305.13523 (2023).
73. Xiong, H. et al. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097
(2023).
74. Toyhom. Chinese medical dialogue data. https://github.com/Toyhom/Chinese-medical-dialogue-data (2023). GitHub
repository.
75. Chen, Y. et al. Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health
conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023).
76. Wang, G., Yang, G., Du, Z., Fan, L. & Li, X. Clinicalgpt: Large language models finetuned with diverse medical data and
comprehensive evaluation. arXiv preprint arXiv:2306.09968 (2023).
77. Ye, Q. et al. Qilin-med: Multi-stage knowledge injection advanced medical large language model. arXiv preprint
arXiv:2310.09089 (2023).
78. Healthcaremagic. https://www.healthcaremagic.com.
79. https://www.icliniq.com/.
80. Byambasuren, O. et al. Preliminary study on the construction of chinese medical knowledge graph. J. Chin. Inf. Process.
33, 1–9 (2019).
81. Zhang, H. et al. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075 (2023).
82. Xu, C., Guo, D., Duan, N. & McAuley, J. Baize: An open-source chat model with parameter-efficient tuning on self-chat
data. arXiv preprint arXiv:2304.01196 (2023).
83. Abacha, A. B. & Demner-Fushman, D. A question-entailment approach to question answering. BMC Bioinforma. 20
(2019).
84. Luo, Y. et al. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint
arXiv:2308.09442 (2023).
85. Zhang, X. et al. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint
arXiv:2310.14558 (2023).
86. Yang, S. et al. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback
and real-world multi-turn dialogue. arXiv preprint arXiv:2308.03549 (2023).
87. Shoham, O. B. & Rappoport, N. Cpllm: Clinical prediction with large language models. arXiv preprint arXiv:2309.11295
(2023).
88. Pollard, T. J. et al. The eicu collaborative research database, a freely available multi-center database for critical care
research. Sci. data 5, 1–13 (2018).
89. Johnson, A. et al. Mimic-iv. https://physionet.org/content/mimiciv/1.0/ (2020).
90. Ankit Pal, M. S. Openbiollms: Advancing open-source large language models for healthcare and life sciences. https:
//huggingface.co/aaditya/OpenBioLLM-Llama3-70B (2024).
91. Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079
(2023).
92. Bosselut, A. et al. Meditron: Open medical foundation models adapted for clinical practice. Preprint (2024).
93. Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com (2023).
94. Yang, L. et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162 (2024).
95. Saab, K. et al. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024).
96. Tanno, R. et al. Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report
generation. arXiv preprint arXiv:2311.18260 (2023).
97. Liévin, V., Hother, C. E. & Winther, O. Can large language models reason about medical questions? arXiv preprint
arXiv:2207.08143 (2022).
98. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35,
24824–24837 (2022).
99. Liu, Z. et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032 (2023).
100. Wang, S., Zhao, Z., Ouyang, X., Wang, Q. & Shen, D. Chatcad: Interactive computer-aided diagnosis on medical image
using large language models. arXiv preprint arXiv:2302.07257 (2023).
101. Gao, Y. et al. Leveraging a medical knowledge graph into large language models for diagnosis prediction. arXiv e-prints
arXiv–2308 (2023).
102. Bodenreider, O. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research
32, D267–D270 (2004).
103. Shi, W. et al. Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-
23/29
making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and
Health Informatics, 1–10 (2023).
104. SRS. https://www.srs.org. Accessed: 2024-05-14.
105. UpToDate. http://uptodate.com. Accessed: 2024-05-14.
106. Dynamed. https://www.dynamed.com. Accessed: 2024-05-14.
107. Kim, J. & Min, M. From rag to qa-rag: Integrating generative ai for pharmaceutical regulatory compliance process. arXiv
preprint arXiv:2402.01717 (2024).
108. Zakka, C. et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).
109. He, K. et al. A survey of large language models for healthcare: from data, technology, and applications to accountability
and ethics. arXiv preprint arXiv:2310.05694 (2023).
110. Zhang, S. et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).
111. Wang, H., Liu, C., Zhao, S., Qin, B. & Liu, T. Chatglm-med. https://github.com/SCIR-HI/Med-ChatGLM (2023).
112. Hu, E. J. et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
113. Li, X. L. & Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190
(2021).
114. Liu, X. et al. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics, 61–68 (2022).
115. Liu, X. et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv
preprint arXiv:2110.07602 (2021).
116. Houlsby, N. et al. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning,
2790–2799 (2019).
117. Dong, Q. et al. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022).
118. Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, 3045–3059 (2021).
119. Gao, Y. et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997
(2023).
120. Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv preprint
arXiv:2402.13178 (2024).
121. Li, X. & Li, J. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871 (2023).
122. Wang, G. et al. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291
(2023).
123. Chen, J. et al. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through
self-knowledge distillation. arXiv preprint arXiv:2309.07597 (2023).
124. Shao, Z. et al. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv
preprint arXiv:2305.15294 (2023).
125. Trivedi, H., Balasubramanian, N., Khot, T. & Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for
knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509 (2022).
126. Asai, A., Wu, Z., Wang, Y., Sil, A. & Hajishirzi, H. Self-rag: Learning to retrieve, generate, and critique through
self-reflection. arXiv preprint arXiv:2310.11511 (2023).
127. Donnelly, K. et al. Snomed-ct: The advanced terminology and coding system for ehealth. Stud. health technology
informatics 121, 279 (2006).
128. Organization, W. H. et al. International classification of diseases:[9th] ninth revision, basic tabulation list with alphabetic
index (World Health Organization, 1978).
129. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. Pubmedqa: A dataset for biomedical research question answering.
arXiv preprint arXiv:1909.06146 (2019).
130. Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical
domain question answering. In Conference on Health, Inference, and Learning, 248–260 (2022).
131. Doğan, R. I., Leaman, R. & Lu, Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization.
J. biomedical informatics 47, 1–10 (2014).
132. Tang, L. et al. Evaluating large language models on medical evidence summarization. npj Digit. Medicine 6, 158 (2023).
133. Van Veen, D. et al. Clinical text summarization: Adapting large language models can outperform human experts. arXiv
preprint arXiv:2309.07430 (2023).
134. Ondov, B., Attal, K. & Demner-Fushman, D. A survey of automated methods for biomedical text simplification. J. Am.
Med. Informatics Assoc. 29, 1976–1988 (2022).
135. Liu, F. et al. Retrieve, reason, and refine: Generating accurate and faithful patient instructions. Adv. Neural Inf. Process.
24/29
Syst. 35, 18864–18877 (2022).
136. Dong, H. et al. Automated clinical coding: what, why, and where we are? NPJ digital medicine 5, 159 (2022).
137. D’Onofrio, G. et al. Emotion recognizing by a robotic solution initiative. Sensors 22, 2861 (2022).
138. Biri, S. K. et al. Assessing the utilization of large language models in medical education: Insights from undergraduate
medical students. Cureus 15 (2023).
139. Vaidyam, A. N., Wisniewski, H., Halamka, J. D., Kashavan, M. S. & Torous, J. B. Chatbots and conversational agents in
mental health: a review of the psychiatric landscape. The Can. J. Psychiatry 64, 456–464 (2019).
140. McDuff, D. et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164
(2023).
141. Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), 353–367
(2023).
142. Lau, J. J., Gayen, S., Ben Abacha, A. & Demner-Fushman, D. A dataset of clinically generated visual questions and
answers about radiology images. Sci. data 5, 1–10 (2018).
143. He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. Pathvqa: 30000+ questions for medical visual question answering. arXiv
preprint arXiv:2003.10286 (2020).
144. Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf.
Process. Syst. 36 (2024).
145. Liu, B. et al. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021
IEEE 18th International Symposium on Biomedical Imaging (ISBI), 1650–1654 (IEEE, 2021).
146. Huang, C.-W., Tsai, S.-C. & Chen, Y.-N. Plm-icd: Automatic icd coding with pretrained language models. arXiv e-prints
arXiv–2207 (2022).
147. Saeed, M., Lieu, C., Raber, G. & Mark, R. G. Mimic ii: a massive temporal icu patient database to support research in
intelligent patient monitoring. In Computers in cardiology, 641–644 (IEEE, 2002).
148. Wang, H., Gao, C., Dantona, C., Hull, B. & Sun, J. Drg-llama: tuning llama model to predict diagnosis-related group for
hospitalized patients. npj Digit. Medicine 7, 16 (2024).
149. Johnson, A. E. et al. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.
Sci. data 6, 317 (2019).
150. Liu, J., Yang, S., Peng, T., Hu, X. & Zhu, Q. Chaticd: Prompt learning for few-shot icd coding through chatgpt. In 2023
IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 4360–4367 (2023).
151. Yang, Z., Batra, S. S., Stremmel, J. & Halperin, E. Surpassing gpt-4 medical coding with a two-stage approach. arXiv
preprint arXiv:2311.13735 (2023).
152. Ma, C. et al. An iterative optimizing framework for radiology report summarization with chatgpt. IEEE Transactions on
Artif. Intell. (2024).
153. Open-i. https://openi.nlm.nih.gov/. Accessed: 2024-05-14.
154. Van Veen, D. et al. Radadapt: Radiology report summarization via lightweight domain adaptation of large language
models. arXiv preprint arXiv:2305.01146 (2023).
155. Hyland, S. L. et al. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint
arXiv:2311.13668 (2023).
156. Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Towards generalist foundation model for radiology. arXiv preprint
arXiv:2308.02463 (2023).
157. Moghani, M. et al. Sufia: Language-guided augmented dexterity for robotic surgical assistants. arXiv preprint
arXiv:2405.05226 (2024).
158. Yu, Q. et al. Orbit-surgical: An open-simulation framework for learning surgical augmented dexterity. arXiv preprint
arXiv:2404.16027 (2024).
159. Xu, H. et al. Enhancing surgical robots with embodied intelligence for autonomous ultrasound scanning. arXiv preprint
arXiv:2405.00461 (2024).
160. Killeen, B. D., Chaudhary, S., Osgood, G. & Unberath, M. Take a shot! natural language control of intelligent robotic
x-ray systems in surgery. Int. J. Comput. Assist. Radiol. Surg. 1–9 (2024).
161. García-Ferrero, I. et al. Medical mt5: an open-source multilingual text-to-text llm for the medical domain. arXiv preprint
arXiv:2404.07613 (2024).
162. Tiedemann, J. Parallel data, tools and interfaces in opus. In Lrec, vol. 2012, 2214–2218 (2012).
163. National Library of Medicine. Clinical trials. https://clinicaltrials.gov/ (2022). Accessed: 2024-05-14.
164. Wang, X. et al. Apollo: Lightweight multilingual medical llms towards democratizing medical ai to 6b people. arXiv
preprint arXiv:2403.03640 (2024).
165. Pieri, S. et al. Bimedix: Bilingual medical mixture of experts llm. arXiv preprint arXiv:2402.13253 (2024).
25/29
166. Tang, C., Wang, S., Goldsack, T. & Lin, C. Improving biomedical abstractive summarisation with knowledge aggregation
from citation papers. arXiv preprint arXiv:2310.15684 (2023).
167. Guo, Y., Qiu, W., Leroy, G., Wang, S. & Cohen, T. Retrieval augmentation of large language models for lay language
generation. J. Biomed. Informatics 149, 104580 (2024).
168. OpenAI. Chatgpt [large language model]. https://chat.openai.com (2023).
169. Qiu, H., Li, A., Ma, L. & Lan, Z. Psychat: A client-centric dialogue system for mental health support. arXiv preprint
arXiv:2312.04262 (2023).
170. Liu, J. M. et al. Chatcounselor: A large language models for mental health support. arXiv preprint arXiv:2309.15461
(2023).
171. Xu, X. et al. Mental-llm: Leveraging large language models for mental health prediction via online text data. Proc. ACM
on Interactive, Mobile, Wearable Ubiquitous Technol. 8, 1–32 (2024).
172. Turcan, E. & McKeown, K. Dreaddit: A reddit dataset for stress analysis in social media. arXiv preprint arXiv:1911.00133
(2019).
173. Naseem, U., Dunn, A. G., Kim, J. & Khushi, M. Early identification of depression severity levels on reddit using ordinal
classification. In Proceedings of the ACM Web Conference 2022, 2563–2572 (2022).
174. Haque, A., Reddi, V. & Giallanza, T. Deep learning for suicide and depression identification with unsupervised label
correction. In International Conference on Artificial Neural Networks, 436–447 (2021).
175. Gaur, M. et al. Knowledge-aware assessment of severity of suicide risk for early intervention. In The world wide web
conference, 514–525 (2019).
176. Sampath, K. & Durairaj, T. Data set creation and empirical analysis for detecting signs of depression from social media
postings. In International Conference on Computational Intelligence in Data Science, 136–151 (2022).
177. Jamil, Z. Monitoring tweets for depression to detect at-risk users. Ph.D. thesis, Université d’Ottawa/University of Ottawa
(2017).
178. Mauriello, M. L. et al. Sad: A stress annotated dataset for recognizing everyday stressors in sms-like conversational
systems. In Extended abstracts of the 2021 CHI conference on human factors in computing systems, 1–7 (2021).
179. Tu, T. et al. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654 (2024).
180. Ren, Z., Zhan, Y., Yu, B., Ding, L. & Tao, D. Healthcare copilot: Eliciting the power of general llms for medical
consultation. arXiv preprint arXiv:2402.13408 (2024).
181. Sun, Z., Luo, C. & Huang, Z. Conversational disease diagnosis via external planner-controlled large language models.
arXiv preprint arXiv:2404.04292 (2024).
182. Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. Ai in health and medicine. Nat. medicine 28, 31–38 (2022).
183. Zhao, Z. et al. Clip in medical imaging: A comprehensive survey. arXiv preprint arXiv:2312.07353 (2023).
184. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
185. Liu, F., Wu, X., Ge, S., Fan, W. & Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report
generation. In IEEE Conference on Computer Vision and Pattern Recognition (2021).
186. Ong, J. et al. Applying large language model artificial intelligence for retina international classification of diseases (icd)
coding. J. Med. Artif. Intell. 6 (2023).
187. Liu, X. et al. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat.
Medicine 25, 1467–1469 (2019).
188. Ali, S. R., Dobbs, T. D., Hutchings, H. A. & Whitaker, I. S. Using chatgpt to write patient clinic letters. The Lancet Digit.
Heal. 5, e179–e181 (2023).
189. Wu, C. et al. Can gpt-4v (ision) serve medical applications? case studies on gpt-4v for multimodal medical diagnosis.
arXiv preprint arXiv:2310.09909 (2023).
190. Papineni, K., Roukos, S., Ward, T. & Zhu, W. BLEU: a Method for automatic evaluation of machine translation. In
Proceedings of Association for Computational Linguistics (ACL) (2002).
191. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Association for Computational
Linguistics (ACL) (2004).
192. Banerjee, S. & Lavie, A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments.
In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or
summarization, 65–72 (2005).
193. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv
preprint arXiv:1904.09675 (2019).
194. Smit, A. et al. Chexbert: combining automatic labelers and expert annotations for accurate radiology report labeling
using bert. arXiv preprint arXiv:2004.09167 (2020).
195. Jain, S. et al. Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463
26/29
(2021).
196. Yu, F. et al. Evaluating progress in automatic chest x-ray radiology report generation. Patterns 4 (2023).
197. Xie, Q. et al. Faithful ai in medicine: A systematic review with large language models and beyond. medRxiv (2023).
198. Ni, Z. et al. Grid: Scene-graph-based instruction-driven robotic task planning. arXiv preprint arXiv:2309.07726 (2023).
199. Wang, J. et al. Large language models for robotics: Opportunities, challenges, and perspectives. arXiv preprint
arXiv:2401.04334 (2024).
200. Pee, L. G., Pan, S. L. & Cui, L. Artificial intelligence in healthcare robots: A social informatics study of knowledge
embodiment. J. Assoc. for Inf. Sci. Technol. 70, 351–369 (2019).
201. Qiu, J. et al. Large ai models in health informatics: Applications, challenges, and the future. IEEE J. Biomed. Heal.
Informatics (2023).
202. Emaminejad, N., Akhavian, R. et al. Trust in construction ai-powered collaborative robots: A qualitative empirical
analysis. arXiv preprint arXiv:2308.14846 (2023).
203. Weerarathna, I. N., Raymond, D. & Luharia, A. Human-robot collaboration for healthcare: A narrative review. Cureus 15
(2023).
204. Moglia, A., Georgiou, K., Georgiou, E., Satava, R. M. & Cuschieri, A. A systematic review on artificial intelligence in
robot-assisted surgery. Int. J. Surg. 95, 106151 (2021).
205. Xia, Y., Wang, S. & Kan, Z. A nested u-structure for instrument segmentation in robotic surgery. In International
Conference on Advanced Robotics and Mechatronics (ICARM), 994–999 (2023).
206. Noll, R., Frischen, L. S., Boeker, M., Storf, H. & Schaaf, J. Machine translation of standardised medical terminology
using natural language processing: A scoping review. New Biotechnol. (2023).
207. Karabacak, M. et al. The advent of generative language models in medical education. JMIR Med. Educ. 9, e48163 (2023).
208. Ahn, S. The impending impacts of large language models on medical education. Korean J. Med. Educ. 35, 103 (2023).
209. Chen, Y., Arunasalam, A. & Celik, Z. B. Can large language models provide security & privacy advice? measuring the
ability of llms to refute misconceptions. In Proceedings of the 39th Annual Computer Security Applications Conference,
366–378 (2023).
210. Rawte, V., Sheth, A. & Das, A. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922
(2023).
211. Stock, A., Schlögl, S. & Groth, A. Tell me, what are you most afraid of? exploring the effects of agent representation on
information disclosure in human-chatbot interaction. arXiv e-prints arXiv–2307 (2023).
212. De Choudhury, M., Pendse, S. R. & Kumar, N. Benefits and harms of large language models in digital mental health.
arXiv preprint arXiv:2311.14693 (2023).
213. Hua, Y. et al. Large language models in mental health care: a scoping review (2024). 2401.02984.
214. Robinson, N., Connolly, J., Suddrey, G. & Kavanagh, D. J. A brief wellbeing training session delivered by a humanoid
social robot: A pilot randomized controlled trial. arXiv e-prints arXiv–2308 (2023).
215. Lai, T. et al. Psy-llm: Scaling up global mental health psychological services with ai-based large language models. arXiv
preprint arXiv:2307.11991 (2023).
216. Ma, Z., Mei, Y. & Su, Z. Understanding the benefits and challenges of using large language model-based conversational
agents for mental well-being support. In AMIA Annual Symposium Proceedings, vol. 2023, 1105 (2023).
217. Chung, N. C., Dyer, G. & Brocki, L. Challenges of large language models for mental health counseling. arXiv preprint
arXiv:2311.13857 (2023).
218. Wang, J., Yang, Z., Yao, Z. & Yu, H. Jmlr: Joint medical llm and retrieval training for enhancing reasoning and
professional question answering capability. arXiv preprint arXiv:2402.17887 (2024).
219. Stokel-Walker, C. Chatgpt listed as author on research papers: many scientists disapprove. Nature 613, 620–621 (2023).
220. Shen, X., Chen, Z., Backes, M., Shen, Y. & Zhang, Y. " do anything now": Characterizing and evaluating in-the-wild
jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825 (2023).
221. Umapathi, L. K., Pal, A. & Sankarasubbu, M. Med-halt: Medical domain hallucination test for large language models.
arXiv preprint arXiv:2307.15343 (2023).
222. Roit, P. et al. Factually consistent summarization via reinforcement learning with textual entailment feedback. arXiv
preprint arXiv:2306.00186 (2023).
223. Chern, I.-C. et al. Improving factuality of abstractive summarization via contrastive reward learning. arXiv preprint
arXiv:2307.04507 (2023).
224. Manakul, P., Liusie, A. & Gales, M. J. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large
language models. arXiv preprint arXiv:2303.08896 (2023).
225. Shuster, K., Poff, S., Chen, M., Kiela, D. & Weston, J. Retrieval augmentation reduces hallucination in conversation.
arXiv preprint arXiv:2104.07567 (2021).
27/29
226. Dhuliawala, S. et al. Chain-of-verification reduces hallucination in large language models. arXiv preprint
arXiv:2309.11495 (2023).
227. Lin, S., Hilton, J. & Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint
arXiv:2109.07958 (2021).
228. Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y. & Wen, J.-R. Halueval: A large-scale hallucination evaluation benchmark for
large language models. arXiv e-prints arXiv–2305 (2023).
229. Liu, F. et al. Auto-encoding knowledge graph for unsupervised medical report generation. In Advances in Neural
Information Processing Systems (2021).
230. Shumailov, I. et al. Model dementia: Generated data makes models forget. arXiv preprint arXiv:2305.17493 (2023).
231. Hoelscher-Obermaier, J., Persson, J., Kran, E., Konstas, I. & Barez, F. Detecting edit failures in large language models:
An improved specificity benchmark. arXiv preprint arXiv:2305.17553 (2023).
232. Liu, F. et al. A medical multimodal large language model for future pandemics. npj Digit. Medicine 6, 226 (2023).
233. Yao, Y. et al. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172
(2023).
234. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 33,
9459–9474 (2020).
235. Hendrycks, D. et al. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275 (2020).
236. Glaese, A. et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375
(2022).
237. Nakano, R. et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332
(2021).
238. Liu, H., Sferrazza, C. & Abbeel, P. Chain of hindsight aligns language models with feedback. arXiv preprint
arXiv:2302.02676 3 (2023).
239. Sallam, M. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives
and valid concerns. In Healthcare, 887 (MDPI, 2023).
240. Tian, S. et al. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings
Bioinforma. 25, bbad493 (2024).
241. Li, H., Guo, D., Fan, W., Xu, M. & Song, Y. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint
arXiv:2304.05197 (2023).
242. Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483
(2023).
243. Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative ai) in healthcare.
NPJ digital medicine 6, 120 (2023).
244. Derraz, B. et al. New regulatory thinking is needed for ai-based personalised drug and cell therapies in precision oncology.
NPJ Precis. Oncol. 8, 23 (2024).
245. Hacker, P., Engel, A. & Mauer, M. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023
ACM Conference on Fairness, Accountability, and Transparency, 1112–1123 (2023).
246. Mökander, J., Schuett, J., Kirk, H. R. & Floridi, L. Auditing large language models: a three-layered approach. AI Ethics
1–31 (2023).
247. Chen, Q. et al. An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics 39,
btad557 (2023).
248. Chen, Q. et al. Large language models in biomedical natural language processing: benchmarks, baselines, and recommen-
dations. arXiv preprint arXiv:2305.16326 (2023).
249. Yin, S. et al. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
250. Tu, T. et al. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334 (2023).
251. Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint
arXiv:2306.00890 (2023).
252. Shu, C., Liu, F. & Shareghi, C. Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities.
https://github.com/cambridgeltl/visual-med-alpaca (2023).
253. Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. arXiv preprint arXiv:2307.15189 (2023).
254. Liu, J. et al. Qilin-med-vl: Towards chinese large vision-language model for general healthcare. arXiv preprint
arXiv:2310.17956 (2023).
255. Huang, H. et al. Chatgpt for shaping the future of dentistry: the potential of multi-modal large language model. Int. J.
Oral Sci. 15, 29 (2023).
256. Li, J., Liu, C., Cheng, S., Arcucci, R. & Hong, S. Frozen language model helps ecg zero-shot learning. arXiv preprint
28/29
arXiv:2303.12311 (2023).
257. Englhardt, Z. et al. Exploring and characterizing large language models for embedded system development and debugging.
arXiv preprint arXiv:2307.03817 (2023).
258. Xi, Z. et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864
(2023).
259. Wang, L. et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
260. Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D. & Ghanem, B. Camel: Communicative agents for "mind"
exploration of large scale language model society. arXiv preprint arXiv:2303.17760 (2023).
261. Tang, X. et al. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint
arXiv:2311.10537 (2023).
262. Organization, W. H. Physical activity (2022). Accessed: Aug. 18, 2023.
263. Connor, M. & O’Neill, M. Large language models in sport science & medicine: Opportunities, risks and considerations.
arXiv preprint arXiv:2305.03851 (2023).
264. Mello, M. M. & Guha, N. Chatgpt and physicians’ malpractice risk. In JAMA Health Forum, e231938–e231938 (2023).
Acknowledgements
This work was supported in part by the Pandemic Sciences Institute at the University of Oxford; the National Institute for
Health Research (NIHR) Oxford Biomedical Research Centre (BRC); an NIHR Research Professorship; a Royal Academy
of Engineering Research Chair; the Well-come Trust funded VITAL project; the UK Research and Innovation (UKRI); the
Engineering and Physical Sciences Research Council (EPSRC); and the InnoHK Hong Kong Centre for Cerebro-cardiovascular
Engineering (COCHE).
Author Contributions
FL, ZL, JL, and DC conceived the project. FL conceived and designed the study. HZ, FL, BG, XZ, and JH conducted the
literature review, performed data analysis, and drafted the manuscript. All authors contributed to the interpretation and final
manuscript preparation. All authors read and approved the final manuscript.
Competing Interests
The authors declare no competing interests.
29/29