Transformer-based_active_learning_for_multi-class_
Transformer-based_active_learning_for_multi-class_
DIGITAL HEALTH
Volume 10: 1–21
Transformer-based active learning for © The Author(s) 2024
Article reuse guidelines:
multi-class text annotation and classification sagepub.com/journals-permissions
DOI: 10.1177/20552076241287357
journals.sagepub.com/home/dhj
Abstract
Objective: Data-driven methodologies in healthcare necessitate labeled data for effective decision-making. However, medical
data, particularly in unstructured formats, such as clinical notes, often lack explicit labels, making manual annotation chal-
lenging and tedious.
Methods: This paper introduces a novel deep active learning framework designed to facilitate the annotation process for
multiclass text classification, specifically using the SOAP (subjective, objective, assessment, plan) framework, a widely recog-
nized medical protocol. Our methodology leverages transformer-based deep learning techniques to automatically annotate
clinical notes, significantly easing the manual labor involved and enhancing classification performance. Transformer-based
deep learning models, with their ability to capture complex patterns in large datasets, represent a cutting-edge approach for
advancing natural language processing tasks.
Results: We validate our approach through experiments on a diverse set of clinical notes from publicly available datasets,
comprising over 426 documents. Our model demonstrates superior classification accuracy, with an F1 score improvement of
4.8% over existing methods but also provides a practical tool for healthcare professionals, potentially improving clinical
documentation practices and patient care.
Conclusions: The research underscores the synergy between active learning and advanced deep learning, paving the way for
future exploration of automatic text annotation and its implications for clinical informatics. Future studies will aim to inte-
grate multimodal data and large language models to enhance the richness and accuracy of clinical text analysis, opening
new pathways for comprehensive healthcare insights.
Keywords
Text classification, text annotation, active learning, transfer learning, deep learning, BERT, clinical text, SOAP
Submission date: 13 July 2023; Acceptance date: 10 September 2024
Introduction 1
College of Computing, Birmingham City University, Birmingham, UK
2
Department of AI and Data Science, Sejong University, Seoul, Korea
In today’s world, patient data are logged into an electronic 3
Department of Computer Science, St John’s University, Jamaica, NY, USA
health record (EHR) system in both structured and unstruc- 4
School of Computer Science, University of Birmingham, Birmingham, UK
tured formats.1 The unstructured form mainly includes clin- 5
School of Computing and Engineering, University of Derby, Derby, UK
6
ical notes, discharge summaries, and diagnostic test reports Department of Software, Sejong University, Seoul, Korea
7
Department of Computer Science and Engineering, Kyung Hee University,
written in natural language. These reports contain vital
Yongin, Korea
information that might help solve clinical questions about
*
patient health conditions, clinical reasoning, and inferen- Current affiliation: Department of Artificial Intelligence, Ajou University,
Suwon-Si, South Korea.
cing. However, due to the time limitation, physicians †
These authors contributed equally to this work.
have difficulty examining the unstructured information at
Corresponding author:
the point of care.2 Traditionally, clinically relevant informa- Sungyoung Lee, Department of Computer Science and Engineering, Kyung
tion from clinical documents is extracted through manual Hee University, Yongin 17104, Korea.
methods with the support of clinical domain experts, Email: [email protected]
Creative Commons NonCommercial-NoDerivs CC BY-NC-ND: This article is distributed under the terms of the Creative Commons Attribution-
NoDerivs 4.0 License (https://creativecommons.org/licenses/by-nc-nd/4.0/) which permits any use, reproduction and distribution of the work as
published without adaptation or alteration, provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/
en-us/nam/open-access-at-sage).
2 DIGITAL HEALTH
which creates hurdles in terms of scalability and costs. At thus increasing the subset-to-annotated text for use in the
the same time, data availability allows researchers to subsequent iterations of the process.10 AL approaches
execute automated algorithms extracting helpful informa- have been applied in a clinical domain to decrease labor-
tion for efficient disease care.3 (NLP) plays a significant intensive data annotation burden and enhance the model
role in the clinical domain for various applications, such classification performance with a few labeled examples
as medical concept identification in different clinical docu- sets.11–14
ments.4 Recently, NLP applications have further diversified In recent times, we have seen a growing amount of bio-
to use for disease outbreak detection, conversion of free text medical data available in textual form. Substantial
to structured features for decision support, answering clin- advances in the development of pretraining language
ical questions, and accessing knowledge embodied in free- representation models provide an opportunity for a range
text clinical and biomedical resources.5 of biomedical domain tasks, such as pretrained word
The information extraction facilitated with NLP led to embedding, sentence embedding, and contextual represen-
automated clinical text classification in clinical predictive tations. According to Beltagy et al.,15 the SciBERT outper-
analytics that emerged with the huge creation of clinical forms the baseline encoder representations from
notes and speedily growing adoption of EHR systems.6 transformers (BERT) model on biomedical tasks.
Two types of techniques: symbolic and statistical machine SciBERT is a deep learning-based language model that
learning, are commonly used for clinical text classification uses the original BERT model and is trained on scientific
tasks.7 Symbolic techniques are used in applications that articles for the biomedical domain.
involve hand-crafted rules by domain experts, like logic Given the inherent difficulties in clinical text annotation
rules and regular expressions. Although rule-based and classification, this work employs a mixed-method
methods are effective in the clinical domain because of sub- design that combines experimental and observational
language properties, it can be laborious to develop a system research components. Our methodology starts by creating
that requires collaboration between technical NLP experts a rule-based system to produce a seed dataset, which is sub-
and clinical domain experts. Moreover, the final applica- sequently utilized to initialize an AL model based on trans-
tions may have limitations of portability and generalization former mechanisms. Implementing this iterative procedure
beyond the scenario for which it was intended.8 not only reduces the amount of annotation required but
Machine learning (ML) methods have been proven to also improves the learning efficiency of the model.
be efficient for the tasks of clinical text classification. Through the utilization of AL and TL approaches, our
However, an effective supervised ML model still needs methodology deliberately chooses and defines data points
human involvement to annotate a huge set of training that optimize the performance of the model. These results
data. The efforts by domain experts to unstructured label indicate a substantial enhancement in the accuracy of cat-
data are a significant blockade of inefficient data analysis.9 egorizing clinical notes. This study proposes a method-
The annotation problem is of primary focus in the medical ology for clinical text annotation and classification by
domain because of the lack of clinical data available to the combining AL and TL learning approaches to minimize
public and expert knowledge for accurate annotations. The human efforts in creating labeled data. The primary chal-
other popular methods, such as crowdsourcing, are unsuit- lenges in supervised ML involve data annotation and how
able for creating labeled clinical training data because of AL can alleviate these obstacles. Specifically, manually
the sensitive nature of the domain. Also, the findings of annotating data poses a significant challenge due to its time-
a systematic review9 show that most datasets used in train- consuming and labor-intensive nature, often leading to bot-
ing ML models for text classification consist of mere hun- tlenecks in large-scale NLP projects. In contrast, AL offers
dreds or thousands of records because of annotation a strategic solution to this problem by selectively querying
blockade. unlabeled data that, once annotated, greatly benefits the
The manual annotation process issues have been tried to model’s learning process. This approach not only stream-
be resolved by modern orthogonal approaches such as lines the annotation process but also boosts the model’s per-
active learning (AL) and transfer learning (TL), which are formance with a potentially smaller, yet richer, dataset.
utilized as machine-assisted pre-annotation methods.10 AL Tackling these challenges directly enables a comprehensive
provides a subset of high-value training samples by redu- understanding of the trade-off between the labor costs of
cing the huge data required for labor-intensive data annota- annotation and the efficiency improvements provided by
tion without losing the quality.11 The initial data for the AL AL, establishing a solid foundation for further investigation
process can be prepare through symbolic techniques, such of our research contributions.
as a rule-based approach combined with a domain- or task- The proposed methodology employed a rule-based NLP
specific lexicon or dictionary like UMLS12 and Bioportal.13 algorithm based on a lexical approach that automatically
The selection of samples is iterative starting with a high- annotates the unlabeled input data to create an initial seed
quality manually annotated subset of samples and moving dataset. Using the initially labeled dataset, we design an
to automatically generate another subset of annotations, AL approach by training transformer-based deep learning
Afzal et al. 3
to enhance the initial seed data. The AL output, that is, the Literature review
enhanced annotated data are used to train the proposed
There are two types of experimental settings available in the
SciBERT-based multiclass classification model to classify
clinical domain: shared-task settings and clinical practice
texts in the clinical documents into four classes of the
settings. In shared task settings, challenging NLP16
SOAP (subject, object, assessment, plan) protocol.5
corpora are typically made accessible with well-defined
SOAP is a well-known structure used for patient informa-
evaluation methods and public availability; hence, they
tion organized into four logical compartments.
are commonly recognized as benchmarks. However, in a
To demonstrate the usefulness of the proposed method-
clinical practice setting, the EHR is used directly for idea
ology, we conducted a set of experiments on clinical notes
extraction in real-world contexts, such as internal medicine
acquired from a public dataset (i2b2/VA 2010).16 The find-
and orthopedics.
ings of the proposed approach indicate a significant reduc-
In a shared-task setting, it is more difficult, expensive, and
tion in annotation costs by achieving higher accuracy
time-consuming to construct a ML-based concept extraction
compared to the existing approaches used for the same
application as there is insufficient annotated data. In a clin-
task in the past. Furthermore, our approach is unique by
ical practice context, clinical information extraction and
applying novel AL methodology enhanced with TL for
classification tasks are performed using symbolic and statis-
embedding to perform text classification tasks using an
tical ML, as stated in the introduction. The current AL tech-
attention-based deep learning model. This approach is dif-
nique has resolved the problem of automated data
ferent from traditional NLP approaches in terms of
annotation. To employ AL strategies, initial data is created
context capturing within SOAP sections. For instance, a
using a symbolic method and domain-specific terminology.
medication “xyz” may appear in a clinical note in two dif-
In the study, MedCATTrainer,11 a web-based interface to
ferent forms; “xyz” is used currently, and “xyz” is pre-
extract medical concepts from EHR free text is developed.
scribed for future use. Here identifying medication names
They obtained the initial semantic annotation from
correctly is not sufficient, but the context is important too.
UMLS,12 an open-source biomedical ontology repository,
Identifying SOAP sections differentiates between the
as well as rule patterns for concept identification, and after-
“xyz” medication as currently in use (subjective) and pre-
ward stored the annotated data in the database.
scribed for the future (plan).
The interface enables a user to semantically edit anno-
Our proposed approach provides an end-to-end solution
tated concepts or contribute semantic annotation to a
involving clinical text preprocessing, a rule-based model
missing concept, which they refer to as the AL technique.
for initial data annotations, a deep AL-based model for After getting many annotated ideas, an ML model such as
enhanced data annotations, and multiclass classification a random forest is employed.
model development, validation, and testing. The proposed However, the early findings have not been presented by
methods are not only useful for clinical text classification the author in the paper, and domain expertise is required to
but other NLP tasks and applications, such as question- effectively run this application and annotate medical con-
answering systems, clinical decision support systems, clin- cepts. Word embedding similarity is another technique
ical follow-up systems, and health technology assessment that plays a key part in the AL process. A model that has
processes. Generally, the automatic clinical annotations been pretrained is used to create the embedding of labeled
and labeling created with our proposed models are helpful and unlabeled data. The embedding similarity between
for any clinical text classification or prediction task that labeled and unlabeled data is then assessed. Within a
needs labeled data. In summary, the key contributions of certain embedding similarity threshold value, unlabeled
this study are as follows: data are classified into a label data category.
In their research, Hussain et al.17 have suggested a unique
approach for identifying causal relationships in clinical text.
• Developing syntactic and semantic algorithms for
Initial data are created using a symbolic approach, and a
unstructured clinical text preprocessing and section
Google News word2vec pretrained18 model is used for
identification to prepare initial training data with
semantic expansion. Using BERT, the extended causal
SOAP labels as seed data for the AL model
terms are turned into an embedded vector afterward. These
• Developing a robust transformer-based AL model with
embedded vectors are then used to calculate a cosine similar-
uncertainty-based sampling—least confidence query
ity matching score against causal words contained in two
strategy—for annotating unlabeled clinical data with
additional datasets. Finally, the domain expert verifies the
SOAP labels
predicted words from different datasets, concluding the AL
• Developing a dual attention network model, which
process.
employs two inputs: (a) SciBERT-based transfer learn-
Ning An et al.19 used word embedding with cosine simi-
ing (TL) input for capturing contextual information
larity to detect causal relationships as a four-class classifica-
and (b) UMLS-based semantic enrichment (UMLS-SE)
tion problem. One-hot encoding converts causal verbs in the
input to help capture semantic information
4 DIGITAL HEALTH
seed list and verbs in NP-VP-NP ternaries into encoding The word representation of the ELMO contains richer
vectors. Based on a Wikipedia dataset, these vectors are information compared to a standard or traditional word
translated using continuous Skip-Gram. The encoded embedding such as Skip-Gram25 and global vectors for
vectors are compared using cosine similarity, and the pair word representation (GloVe).26 Although the ELMO
with the most similarity over 0.5 is used to classify the model is shown to have a good performance in some
causal relationship and update the seed list. This technique name entity recognition (NER) tasks, such as the CoNLL
earned an F-score of 78.67%, a substantial improvement 2003 NER task, it is trained in a general domain and, as a
over earlier causal link detection efforts. result, does not demonstrate the desired performance for a
Li et al.20 have used AL to reduce annotation require- clinical concept extraction task.
ments in the deidentification workflow by incorporating The transformer architecture resolves these issues by an
real clinical trials and i2b2 datasets to show e improved per- attention mechanism, which creates an entire sequence from
formance of trained models compared to the traditional the whole document and trains the model in a parallel
passive learning framework. fashion. Various TBL models with slight differences exist
Similarly, Tomanek and Hahn21 examined the impact of for modern NLP tasks, but the performance of BERT-based
AL in decreasing the time required for data annotation for models is exceptional.27
entities (person, organization, and location) extraction. They BioBERT28 and ClinicalBERT29 are recent examples of
noticed that the AL process significantly decreases up to domain adaptations of BERT. BioBERT is trained on
33% data annotation time and cost compared to baseline. PubMed abstracts and PMC full-text publications, while
Chen22 conducted a simulation experiment to reannotate a ClinicalBERT is trained on MIMIC-III clinical text.29
subset of the i2b2/VA 2010 dataset from the concept extrac- SciBERT15 is trained on the complete text of 1.14 million
tion challenge. Their results showed that the AL-based biomedical and computer science publications from the
query strategy reduced the volume of data needed for Semantic Scholar corpus to increase performance on subse-
manual annotation compared to baseline. quent scientific NLP tasks. The SciBERT is assessed for
AL is used in other domains such as sentiment ana- five fundamental NLP tasks, including NER, participants,
lysis,23 where the authors proposed a novel active deep interventions, comparisons, and outcomes (PICO) in a clin-
network (ADC) to solve the problem of the small dataset ical trial publication,30 text classification, relation classifica-
in the sentiment classification problem. In another study tion, and dependency parsing. We used SciBERT for SOAP
by Hajmohammaadi et al.,24 they used AL and self-training label classification, a clinical protocol used for patient infor-
for cross-lingual sentiment classification and other baseline mation management into four logical compartments
models to check the effectiveness of their proposed model; because we believe SciBERT has already been evaluated
they found that AL performed better when compared with on PICO, a clinical protocol used for clinical questions in
baseline models (without using AL). terms of problem, intervention, comparison, and outcome.
In addition to AL, researchers have used TL to learn So, the literature review highlights the intersection of
knowledge from previously learned domains and apply it deep learning and NLP as a frontier of innovation in the clin-
to newer domains and tasks. Most real-world applications ical domain, demonstrating the potential for significant
suffer from data deficiency that results in suboptimal advancements in healthcare delivery and patient care. The
models based on deep learning approaches. TL is touted evolution from traditional NLP techniques to the adoption
to address this issue by allowing pretrained models from of advanced methodologies like AL, TL, and the integration
domain A to be applied to tasks in another domain B; of domain-specific transformer-based models signifies a
both A and B are related domains. TL is the dominant transformative shift towards more accurate, efficient, and
approach leveraged by leading language models such as nuanced processing of clinical data.
RNNs, LSTMs, and transformer-based language (TBL).
These models can be used for any downstream task, lan-
guage, or domain. The TBL models perform better on Methodology
various NLP tasks as compared with other models. In This section describes the proposed framework of SOAP-based
modern NLP techniques, the researcher combines TL data labeling and classification of clinical text. The framework
methods with large-scale TBL models to achieve better per- is divided into three steps, as shown in Figure 1. In the first step,
formance. The existing language models based on RNNs a rule-based algorithm (“SOAPNotesParser”) is employed for
and LSTMs suffer the vanishing gradient problem and initial data labeling (seed data annotations). According to the
cannot handle the longer contextual dependencies. SOAP protocol, the rule-based algorithm includes both syntac-
The LSTM-based models, such as ELMO (embeddings tic and semantic approaches to annotate different sections in the
from language model) or ULMFiT (universal language clinical notes. In the second step, an AL model is designed to
model fine-tuning) are still used for modern NLP tasks. create more data with SOAP labels as a training dataset for
Still, the main limitations of LSTM-based models are chal- the classification model. Finally, a pretrained model is used
lenging to train in a parallel way. to create embeddings to enrich the training data for attaining
Afzal et al. 5
Figure 1. SOAP-based data labeling and classification framework of unstructured clinical notes.
Table 1. Dataset sources along with the number of clinical notes. The clinical notes encompass a variety of explicit and
implicit sections, meticulously annotated to align with the
Dataset Source Clinical Notes SOAP framework. This approach ensures a structured ana-
lysis and classification of the clinical text.
Partners healthcare 97
Beth Israel deaconess medical center 73 Initial label dataset for the active learning process:
i2b2 national center 256 Annotation and preparation
We developed and implemented an algorithm
Total 426 (“SOAPNotesParser”) to efficiently parse and label clinical
notes according to the SOAP framework for the AL process
as shown in Figure 2. In this step, we selected 20 clinical
data and gaining maximum throughput out of the final deep notes having explicit header sections. This process is
learning model, which we eventually utilize to classify the designed to transform unstructured clinical text into orga-
unseen clinical notes. nized data, facilitating the AL process.
This initial dataset was intentionally diverse, spanning
various document types, medical specialties, and patient
Dataset demographics to ensure broad representation. Selected clin-
This study was conducted using a dataset composed of ical notes were those with high annotation confidence by
unstructured clinical discharge summaries collected from the “SOAPNotesParser” and input from domain experts,
three key sources: the i2b2 National Center, Partners focusing on the inclusivity of both common and rare condi-
Healthcare, and Beth Israel Deaconess Medical Center, as tions and consideration for evolving medical practices.
shown in Table 1. The dataset’s comprehensive breakdown The algorithm consists of the following steps.
is as follows, highlighting the number of clinical notes and
the distribution of labeled versus unlabeled data. Partners 1. Identifying the SOAP sections. The initial step involves
Healthcare consists of 97 clinical notes, Beth Israel scanning the clinical note for indicators of the main
Deaconess Medical Center contains 73 clinical notes, and SOAP sections: subjective, objective, assessment, and
the dataset provided by the i2b2 National Center for plan. These sections are integral to the structure of clinical
System Evaluation contains 256 clinical notes. documentation, each serving a distinct role in encapsulat-
Cumulatively, we utilized 426 unstructured clinical dis- ing different aspects of patient care. By recognizing the
charge summaries in the proposed methodology. These keywords or phrases that typically denote the beginning
clinical notes consist of explicitly defined sections used of each section, the algorithm effectively demarcates the
for section-based SOAP annotation. boundaries of these categories within the text.
6 DIGITAL HEALTH
2. Accumulating text under each section. Once a section the text. This capability is particularly important given
header is identified, the algorithm accumulates text corre- the wide range of documentation styles and conventions
sponding to that section. It captures the content line by used across different healthcare settings. Some parts of
line, aggregating it until a new section header is encoun- this work can be referred from our previous work31 and
tered. This ensures that all information pertinent to a par- the section header terminology lexicon (SHTL) is based
ticular aspect of the SOAP framework is grouped, on works.32
maintaining the integrity and context of the original clin- 5. Producing structured output. The culmination of the
ical note. Importantly, the algorithm skips the header line parsing and labeling process is the generation of struc-
itself to avoid redundancy, focusing instead on the sub- tured output. The algorithm converts the categorized
stantive content that follows. text into a format that is amenable to further analysis,
3. Assigning labels to text. As the algorithm aggregates text such as a list of dictionaries. Each entry in the output
under each SOAP section, it also assigns appropriate indicates the section to which the text belongs, along
labels to this content, indicating whether it pertains to sub- with the labeled content itself.
jective, objective, assessment, or plan aspects of patient
care. This labeling is crucial for downstream applications,
providing a clear, structured framework for analyzing the
Preprocessing and auto-labeling using the active
note’s content. The process distinguishes between differ-
ent types of clinical information, from patient-reported learning process
symptoms to treatment plans, enhancing the utility of The initial dataset from the 20 clinical notes obtained pro-
the extracted data. duced around 243 label instances as training datasets for
4. Handling subheadings and complex structures. The the AL process. So, to label the remaining clinical notes
algorithm is adept at navigating the complexities of clin- (406 in total), initially, we employed preprocessing, where
ical documentation, including various subheadings and clinical notes were segmented into individual sentences.
nuanced formatting that may occur within each main This segmentation was executed based on specific rules: a
SOAP section. By employing a flexible parsing strat- newline character (“\n”) or the occurrence of a period fol-
egy, it can accommodate diverse document structures, lowed by a space and an uppercase letter, indicative of the
ensuring comprehensive and accurate categorization of start of a new sentence. Following sentence segmentation,
Afzal et al. 7
two noteworthy observations were made: the prevalence of convergence. Convergence is defined as the point at
short sentences (those with fewer than a prespecified which additional training on new data does not significantly
number [<5] of words) and duplicate sentences. These char- improve the model’s performance, indicating that the model
acteristics can be attributed to the uniformity in documenting has achieved its maximum learning potential given the
physical and medical examinations, along with the concise available data.
way medical records are often completed by healthcare During this iterative process, the use of SciBERT,15 a
professionals. pretrained transformer-based model specifically tailored
To obtain a cleaner dataset, we filtered out short sen- for scientific text, plays a pivotal role. The choice of
tences and duplicates with preprocessing. The natural lan- SciBERT, with its uncased variant, allows for the construc-
guage toolkit (NLTK) again used it to convert the words tion of high-quality embedding vectors from the small-label
by finding tokens out of them and excluded sentences that dataset. These embeddings capture the semantic nuances of
had less than five words (not very helpful information can the scientific domain, enabling more effective model train-
be drawn from a sentence with as few as five or so ing than would be possible with general-purpose language
words). Through the setting of a threshold on size, we models.
ensured that all other sentences had the most information. The AL methodology detailed in this study underscores
It also removed duplicate sentences, so it reduced the text the efficiency of using a targeted approach to data annota-
pattern for unique content. This was critical to improving tion. By focusing on instances where the model’s certainty
the performance of AL in clinical note classification, as is lowest, the AL strategy ensures that the model’s training
well-curated examples improved relevance and accuracy. is both efficient and effective, reducing the need for a vast
For this study, we used an AL approach using a small-text amount of labeled data.25 This approach is particularly
framework to choose the most insightful unlabeled data beneficial in domains where labeled data is scarce or expen-
from the pool.33 sive to obtain, such as specialized scientific fields.
To further elaborate on the AL process, once the initial Moreover, the adoption of a pool-based sampling strat-
classifier is trained using the small seed initial dataset, it egy, as opposed to stream- or membership-based selection,
employs the small-text framework33 for executing the pool- is motivated by the practical considerations of having a rela-
based sampling with the least confidence query strategy as tively small-label dataset and a substantially larger pool of
shown in Figure 3. This technique involves presenting the unlabeled data.34 The pool-based approach allows for a
model with unlabeled data and asking it to predict labels more systematic exploration of the data space, ensuring
for these instances. Those with the lowest confidence in that the model encounters a diverse set of examples
their predictions are deemed the most valuable for learning during its training. This diversity is critical for developing
because they represent the boundary cases about which the a robust model capable of generalizing new, unseen data
model is most uncertain. well.
The selected instances are then reviewed by an oracle—a In conclusion, the AL approach described here leverages
human expert or an automated system capable of providing the strengths of the small-text framework, SciBERT embed-
the correct labels. This step is crucial as it ensures that the dings, and a judicious selection strategy to efficiently tackle
model is trained on accurately labeled data, thereby enhan- the challenge of text annotation in a data-scarce environ-
cing its learning efficiency. Once the oracle annotates the ment. The methodology’s emphasis on targeting model
chosen instances with the correct labels, these newly uncertainty and iteratively refining the training dataset
labeled examples are added to the training dataset, and through expert annotation leads to a significantly improved
the model is retrained. This iterative cycle of prediction, model performance. This approach not only accelerates the
selection by least confidence, and retraining with newly process of model development but also enhances the
labeled data continues until the model reaches a state of model’s accuracy and generalizability, making it a valuable
Figure 3. A step-by-step process of automatic text annotation using an active learning approach.
8 DIGITAL HEALTH
strategy for advancing ML applications in specialized Following the Bi-LSTM, the attention layer acts as a preci-
domains. By the end of this process, we successfully sion tool, spotlighting the salient words pivotal for accurate clas-
labeled the 3146 instances as a training dataset. sification. It addresses the potential information dilution in
Bi-LSTM by applying a weighted sum to the encoded states,
thus preserving valuable information. The attention weights,
SOAP-BioMedBERT—The proposed model derived from a small dedicated neural network atop each
With the AL model, we add enough labeled records to the encoded state, culminate in a single-unit output that denotes
dataset, which is sufficient to use as a training dataset for a the attention weight, further refined by dense layers and a tanh
state-of-the-art deep learning model. Furthermore, we activation function inspired by Bahdanau Attention.35
developed an attention-based deep learning model named The implementation of this comprehensive encoding and
“SOAP-BioMedBERT.” A high-level workflow architec- attention strategy was a deliberate choice, balancing the com-
ture of the proposed model for classifying clinical notes putational overhead against the substantial gains in contextual
with SOAP labels is depicted in Figure 4. interpretation it offers. This calculated decision underscores
The proposed model utilizes TL and UMLS-based semantic our dedication to innovating while remaining sensitive to
enrichment (UMLS-SE) to achieve optimal results. The com- the nuanced requirements of clinical text analysis.
bination of the two networks was intended to help capture
both contextual and semantic information in clinical notes. Semantic information network. A semantic information
In the model, the weight-tuning operation is activated network as shown in Figure 4(b) is used to capture domain-
along with the SOAP-based training dataset to learn specific specific semantic information. For extracting the medical
characteristics of the data. Firstly, the clinical text is normal- entity and their concept from the given text, a component of
ized using data preprocessing techniques such as removing the scispaCy36 NER model is utilized, and the UMLS is
accented characters, expanding contractions, removing used as a knowledgebase for entity linker in the scispaCy com-
special characters, stemming, and removing stop words. ponent. It returns a concept unique identifier (CUI), name, def-
Then, the normalized clinical text is inputted into two pro- inition, type unique identifier (TUI), and aliases. Embeddings
posed networks for predicting the final SOAP label. Both are generated from the extracted UMLS semantic information
networks combine concatenation, dropout, and dense layers for the inputted sentence, followed by Bi-LSTM and attention
using the SoftMax activation function. The cross-entropy layers as the contextual information network. In order to
loss is optimized using Adam and a dropout of 0.3. provide a comprehensive semantic representation of clinical
concepts, several fields are required.
Contextual information network. Our network is meticulously Embeddings are created using the extracted UMLS
architected to capture the intricate context of clinical text, semantic information. These embeddings include all the
employing three distinct layers: word embedding, encoding, fields mentioned before: the CUI, which distinct identifies
and attention layer. We utilize the pretrained SciBERT-based each concept; the name, which serves as a standard refer-
uncased model,15 which operates on a BERT-based architec- ence; the definition, which provides contextual understand-
ture with 24 layers, at the word embedding stage. This ing; the TUI, which classifies the concept within larger
transformer-based model, initially representing words in their medical hierarchies; and aliases, which capture different
embedded form, employs multiheaded attention across each synonymous expressions of the concept. Through the util-
layer to iteratively refine word representations, informed by ization of these multiple fields, the embeddings effectively
the surrounding textual context as shown in Figure 4(a). capture both the clear identification and contextual connec-
The BERT architecture is adept at capturing bidirec- tions of medical terminology, which are essential for pre-
tional contextual cues. However, the clinical domain’s cisely understanding the subtleties in clinical writing.
nuanced linguistic structure demands enhanced processing To enhance their representation, these embeddings are
capabilities. Hence, we enrich the SciBERT embeddings further put through a Bi-LSTM layer and an attention
with a bi-directional long short-term memory (Bi-LSTM) layer, which are analogous to the contextual information
network. This addition strategically augments the model’s network. The incorporation of the UMLS fields guarantees
capacity to discern long-range dependencies and complex that the model not only identifies particular medical items
patterns, typical of clinical narratives. but also comprehends their wider semantic and contextual
Bi-LSTMs offer a significant advantage due to their dual- implications, which is crucial for the precise and depend-
directional processing—capturing information from both able categorization of clinical narratives.
past and future contexts within a sequence. This property
is especially advantageous for clinical texts, which often
hinge on the temporal sequence of events and interdependen- Performance evaluation metrics
cies of medical terms. By integrating the Bi-LSTM layer, we To measure the merit of the algorithms, we use four statis-
achieve a more profound contextual understanding, yielding tical indicators (recall, precision, F1-score, and accuracy)
more precise and reliable classifications. for the evaluation, and the computing formulas of these
Afzal et al. 9
Figure 4. The proposed framework architecture shows two inputs: (a) contextual information network and (b) semantic information
network, concatenated to generate multi-class output: subjective, objective, assessment, and plan.
metrics are given in equation (1). where TP: true positive, FP: false positive, TN: true nega-
Recall =
TP tive, and FN: false negative.
TP + FN
TP
Precision =
TP + FP
(2) Experimental results and analysis
2(Rec ∗ Prec)
F1 − Score =
Rec + Prec The proposed methodology outlined earlier provides a the-
TP + TN oretical foundation for clinical information identification
Accuracy =
TP + FP + FN + TN and classification from unstructured clinical documents.
10 DIGITAL HEALTH
To construct a robust implementation of this study, it is information and UMLS-based semantic enrichment
crucial to determine the specific models and algorithms (UMLS-SE) input to help capture semantic information.
that can optimize each component individually, thereby In this section, we have illustrated the experimental
producing high-performance intermediate results. These results and presented an analysis.
results can then be combined to achieve an overall
optimal outcome for clinical information classification.
We conducted numerous experiments to assess the effects Active learning (AL) performance evaluation
of a rule-based approach for initial training data preparation Optimizing active learning through strategic query selection.
with SOAP labels as seen in data for the AL model and ava- In the domain of AL, the efficiency of model training is
luation of a transformer-based AL model with uncertainty- often leveraged through the careful selection of data
based sampling, least confidence query strategy, for anno- points from which the model can learn most effectively.
tating unlabeled clinical data with SOAP labels. Finally, In our recent study, we applied various AL query strategies
we evaluated a dual attention network model incorporated to an annotation task on a dataset, initially comprising 243
SciBERT-based TL input for capturing contextual records. These records were preannotated with a baseline
Afzal et al. 11
rule-based algorithm (“SOAPNotesParser”), from which unseen data. After the iterative training and testing phases,
we utilized the full set as the seed data to initialize our we applied the AL model to annotate the remaining records.
AL model. To refine the model’s learning process and to The enriched dataset, thus augmented to encompass a total
maintain a manageable workload for the human annotators of 3146 records, will serve as a foundation for future research
involved in verification, we adopted an iterative approach, and applications within our AL framework.
selecting 290 records per iteration for model training and This study affirms the value of employing judicious
evaluation. For each iteration, the dataset was automatically query strategies in AL to optimize the annotation process.
divided into an 80/20 ratio for training and validation. The least confidence strategy has demonstrated its potential
Our methodology involved a comparative analysis of to expedite the attainment of high accuracy in model train-
four distinct query strategies within the pool-based sam- ing, thereby streamlining the path toward developing more
pling paradigm: least confidence, prediction entropy, capable and efficient ML models. This approach aims to
random sampling, and breaking ties. We monitored the refine model predictions by focusing on cases where the
accuracy rates obtained by the model under each strategy model is least certain. The criteria for determining low pre-
across 10 iterations, aiming to ascertain the efficacy of diction probabilities would involve threshold-based selec-
these strategies in enhancing the model’s performance. tion, where instances below a certain confidence level are
The least confidence strategy concentrates on data points reviewed. Implementing an oracle is expected to signifi-
where the model has the lowest level of confidence in its cantly improve model accuracy and reliability by ensuring
predictions, usually determined by the predicted class only high-confidence predictions are used or by correcting
having a probability close to 0.5. By assimilating knowl- mispredictions during training.
edge from these ambiguous instances, the model enhances
its ability to manage uncertainties. By the 10th iteration,
this method attained the greatest accuracy of 94% in our SOAP-BioMedBERT model performance valuation
testing, with minor improvements in subsequent iterations, In this section, we take a closer look at how well our pro-
suggesting convergence. posed SOAP-BioMedBERT dual attention network model
The prediction entropy technique chooses data points that performs in classifying clinical text. Our goal is to see how
exhibit the greatest uncertainty among all classes, employing effectively the model captures both the context and the
entropy as a metric to quantify the level of unpredictability in deeper meaning of the text by combining SciBERT-based
prediction. Although rather less precise than least confidence, TL with UMLS-based semantic enrichment. We examined
it outperformed random sampling by introducing diversity in the model’s performance using key metrics such as accuracy,
the training data, therefore enabling the model to differentiate F1 score, precision, and recall, and evaluated over several
between comparable classes. iterations of AL. We will also compare our model’s perform-
Conversely, random sampling functions as a basic refer- ance to baseline models to highlight the improvements our
ence point where data points are selected at random, approach offers. Through this analysis, we aim to show
without considering the uncertainty of the model. This how our dual attention network can accurately sort clinical
approach yielded somewhat slower enhancements in accur- notes into the SOAP framework and demonstrate the real-
acy, therefore validating the superiority of more deliberate world advantages of using dual attention mechanisms for
selection techniques. clinical text classification.
Lastly, the breaking ties strategy focuses on situations
when the model encounters difficulty in selecting between Stratified k-fold and training. The dataset was subjected to a
two probable results. Through its emphasis on these stratified k-fold (k = 5) cross-validation procedure to train the
ambiguous situations, it enhances the process of making model and gather evaluation metrics. In this approach, the
decisions at the periphery. Although superior to random dataset is partitioned into k equally sized subsets, with each
sampling, it did not achieve the same level of performance subset maintaining the same proportion of class labels as the
as the least confidence model. original dataset. This method ensures that every class is appro-
The experimental results, depicted in Figure 5(a), illus- priately represented in each fold, as demonstrated in Table 2,
trate the trajectory of the model’s training accuracy. The which outlines the distribution of classes across the folds.
least confidence strategy demonstrated superior performance, During the cross-validation process, the model under-
resulting in an optimal training accuracy of 94% by the 10th goes k iterations of training and evaluation. In each iter-
iteration. Notably, the accuracy plateaued between the ninth ation, one of the k subsets is designated as the test set,
and 10th iterations, which indicated a point of convergence and the remaining k–1 subsets serve as the training set.
and served as our cue to cease further AL sample selection. This strategy ensures comprehensive use of the data, with
The robustness of the least confidence strategy was further each subset getting an opportunity to be the test set
validated during the testing phase, as portrayed in Figure 5(b). exactly once. Consequently, every sample in the dataset is
Here, the strategy outshone its counterparts, suggesting its utilized for both training and testing purposes, promoting
greater reliability in generalizing from the AL model to a thorough and balanced evaluation.
12 DIGITAL HEALTH
Trainable parameter optimization. To determine the most Figure 6 illustrates the process of selecting the learning
effective configuration for the model’s parameters, our rate to optimize and assess the model. For these experiments,
approach involved a systematic exploration of error rates we utilized the sci-kit-learn library, adhering mainly to the
through trial-based methods, aiming for superior accuracy default settings for hyperparameters. Table 3 presents a
in classification tasks. This involved an exhaustive search snapshot of the various hyperparameters applied to the
for the ideal learning rate while keeping other hyperpara- models under study, showcasing the diversity in our experi-
meters constant, to pinpoint the learning rate that minimizes mental setup.
loss and thus enhances the model’s reliability. Each model underwent training using identical fold divi-
In our quest to identify the most suitable optimizer for sions, employing the deep learning models implementation
our study, we compared the performance of Adam, using PyTorch framework on an NVIDIA GeForce RTX
RMSprop (RMSP), and stochastic gradient descent (SGD) 3060 with 32 GB memory. The training process spanned 10
using the trial-based error approach. The outcome of this epochs, with hyperparameters configured to a maximum
comparison favored the Adam optimizer, which demon- token size of 512, a batch size of 32, and a learning rate of
strated superior prediction accuracy. 10e-3. Although additional epochs were explored in subse-
quent experiments, they did not yield significant performance
Table 2. Distribution of data across five folds in stratified k-fold improvements.
cross-validation.
Subjective Objective Assessment Plan Total Comparative analysis between proposed (SOAP-BioMedBERT)
and other BERT-based models. In our study, we initially per-
Fold 1 1037 944 535 630 3146 formed experiments using traditional machine-learning
models to establish a baseline for clinical text classification.
Fold 2 1020 951 538 636 3145 Among these, the support vector machine (SVM) model
was particularly noteworthy due to its versatility and effect-
Fold 3 1042 923 523 657 3145
iveness in handling high-dimensional data. Characterized
Fold 4 1061 963 504 617 3145 by its use of a linear kernel and optimized through a meticu-
lous process of hyperparameter tuning, the SVM model was
Fold 5 1030 919 556 641 3146 deployed to classify clinical texts into the predefined cat-
egories of assessment, subjective, objective, and plan.
Table 3. Parameter settings of the proposed model. Table 4. Improvement in neural network accuracy after integrating
BERT embeddings: A comparison of CNN, RNN, and Bi-LSTM model
Hyperparameters Value performances before and after BERT adoption, highlighting
accuracy gains.
Max sequence length 10 k
Accuracy Before Accuracy After
Batch_size 32 Model BERT BERT Improvement
Epochs 10
Table 5. Comparative performance metrics of BERT-based models in SOAP category text classification: accuracy, precision, recall, and
F1-score.
DistilBERT
BioBERT
Bio-ClinicalBERT
PubMedBERT-base
SciBERT
(continued)
Afzal et al. 15
Table 5. Continued.
Proposed (SOAP-BioMedBERT)
Figure 7. F1-score performance comparison of six BERT-based models across subjective, objective, assessment, plan, and total categories
in biomedical text classification.
Figure 7 illustrates a comparison of F1-scores across among these models in biomedical text classification,
various BERT-based models, including DistilBERT, with SOAP-BioMedBERT emerging as the top performer
BioBERT, Bio-ClinicalBERT, PubMedBERT-base, across all categories. This comparison not only showcases
SciBERT, and SOAP-BioMedBERT, across five key cat- the effectiveness of specialized models in capturing bio-
egories: subjective, objective, assessment, plan, and total. medical nuances but also aids in selecting the most suit-
The visualization highlights the performance disparities able model based on a balance of computational
16 DIGITAL HEALTH
efficiency and accuracy for specific biomedical text ana- In the evaluation of the model’s performance over 10
lysis tasks. epochs, two key indicators were observed: loss and accur-
The DistilBERT model, while effective, lagged behind more acy, both for training and testing datasets. The upper graph
specialized models like BioBERT and PubMedBERT-base, showcases the training and testing loss over successive
reflecting the potential limitations of more generalized pretrain- epochs. Initially, both the training and testing losses start
ing when applied to domain-specific tasks. relatively high, with the training loss demonstrating a
The results underscore the effectiveness of domain- sharp decline by the second epoch, indicating that the
specific pretraining, as evidenced by the superior perform- model is learning from the training data. The testing loss,
ance of SOAP-BioMedBERT, which has been specifically while decreasing overall, shows fluctuations, suggesting
tailored for biomedical text. This model’s enhanced variability in how the model generalizes to new, unseen data.
ability to grasp the nuances of medical literature is attribu- By the 10th epoch, the training loss has significantly
ted to its training on a comprehensive corpus of biomedical decreased, suggesting that the model fits well with the train-
texts, allowing for improved context understanding and ing data. However, there is a noticeable gap between the
semantic interpretation. training and testing loss, potentially indicating overfitting,
The slight edge of SOAP-BioMedBERT over other as the model may not be generalizing as effectively to the
models can be attributed to its optimized architecture and testing data.
training regimen, which was meticulously designed to The lower graph illustrates the training and testing accur-
capture the intricate patterns and terminologies prevalent acy. Here, the training accuracy consistently improves over
in biomedical documents. Its performance suggests that time, indicative of the model effectively learning and
further advancements in model architecture and training making better predictions on the training data. The testing
methodologies could yield even more significant improve- accuracy after initial fluctuations shows an upward trend,
ments in text-processing capabilities for biomedical but it does not reach the level of training accuracy by the
applications. final epoch. This again could signal overfitting, where the
The study also highlights the importance of selecting the model’s improvements are more reflective of the training
appropriate model for specific tasks within the biomedical data patterns rather than a generalized learning applicable
domain. While general-purpose models like DistilBERT to the test data.
offer broad applicability, specialized models like Overall, while the model demonstrates an aptitude for
SOAP-BioMedBERT provide the precision and accuracy learning and improving its performance on the training
necessary for high-stakes environments like healthcare data, the discrepancy between training and testing metrics
and medical research. suggests that further tuning is required to improve general-
ization and prevent overfitting. Strategies such as regular-
Analysis of the best model—SOAP-BioMedBERT. The BERT ization, dropout, or expanding the dataset may be
model’s better performance gave us the confidence to check considered to enhance the model’s performance on
with other embedding options. Finally, we incorporated the unseen data.
BERT-based embedding layer called “scibert-basevocab-
uncased” together with the UMLS-based embedding layer,
which produced the most excellent results of about 98% accur- Discussion
acies, which was better than all other configurations, and the When comparing our work to existing studies, such as those
loss was a minimum of about 1%. Our methodology incorporates conducted by Mowery et al.37 and de Oliveira et al.,5 we
a Bi-LSTM layer on top of the SciBERT model to better grasp the observe noteworthy patterns in performance as captured
long-range dependencies and complex linguistic structures in clin- by the F1-scores for various classes as shown in Table 6.
ical texts. This addition, while beneficial for model perform- Our methodology yields consistently higher scores across
ance, introduces significant computational overhead and all classes, with the “subjective” class showing a notable
resource demands. The sequential nature of RNNs extends increase from 0.939 and 0.9477 to 0.98591. This suggests
training times and requires considerable GPU resources, that our approach may more effectively capture the
impacting both memory and processing power, particularly nuances of subjective information within the data.
during backpropagation. To assess the trade-offs of this archi- Similarly, in the “objective” class, our F1-score of 0.98690
tectural choice, we conducted a performance-cost analysis, surpasses the previous high of 0.9566 by de Oliveira et al.,5
evaluating accuracy, precision, recall, and F1 score against indicating a stronger ability to identify and classify objective
training duration and GPU usage. This approach provides statements correctly. This improvement is critical, as object-
insights into the balance between enhanced model capabilities ive data is often essential for drawing concrete conclusions
and the associated computational and resource implications. from research findings.
The proposed model is tested on multiple points to get The “assessment” and “plan” classes also show signifi-
the desired number of epochs, and we obtained the cant improvements in our work. The “assessment” class,
optimal results on epoch 10 as shown in Figure 8. which has traditionally presented challenges as indicated
Afzal et al. 17
Table 6. Comparative analysis of F1-scores across different studies assessment-related content, which is crucial for medical
for classifying clinical notes. diagnosis and treatment planning.
In the “plan” class, we see an improvement from 0.770
Mowery de Oliveira Our work
and 0.9435 to 0.98360, indicating our model’s strength in
et al.5 et al.37 (SOAP-BioMedBERT)
effectively recognizing planning actions, which are impera-
Class F1-Score F1-Score F1-Score tive for the implementation of medical care.
It is important to note that while these improvements are
Subjective 0.939 0.9477 0.98591 promising, they are not solely indicative of the superiority
of our model. Various factors, such as dataset composition,
Objective 0.945 0.9566 0.98690 labeling consistency, and model architecture, can influence
these results. Moreover, our model’s increased performance
Assessment 0.757 0.7323 0.98344 in the “assessment” and “plan” classes, which are particu-
larly challenging due to their predictive and prescriptive
Plan 0.770 0.9435 0.98360
nature, may suggest a potential for our model to better
understand and process complex sentence structures and
semantics associated with medical decision-making.
by the relatively lower scores of 0.757 and 0.7323, sees a dra- The results underscore the Bi-LSTM layer’s contribution
matic increase to 0.98344 in our study. This substantial to enhancing the model’s performance on clinical text classi-
enhancement suggests that our model may possess a heigh- fication tasks. The incremental gains in accuracy and F1 score
tened sensitivity to the key features that distinguish justify the additional computational resources, especially for
18 DIGITAL HEALTH
• “Patient complains of persistent headaches and blurred patient may have pneumonia. Plan: Start antibiotics and
vision over the past few days.” order a chest X-ray.
• “Physical examination shows no neurological deficits,
but blood pressure is significantly elevated.” Preprocessing. The clinical note is segmented into the fol-
• “The assessment is that the patient may be experiencing lowing sentences:
hypertensive crisis.”
• “Plan: recommend lifestyle changes, initiate antihyper- • “Patient describes feeling extremely fatigued and having
tensive therapy, and schedule follow-up.” a persistent dry cough for the last two weeks.”
• “Physical examination indicates decreased breath
sounds in the right lung.”
Active learning strategy. The AL model, trained on the initial
• “The assessment is that the patient may have
seed data, uses the least confidence strategy to select sen-
pneumonia.”
tences with low prediction confidence for manual annota-
• “Plan: Start antibiotics and order a chest X-ray.”
tion. In this note, the model shows low confidence in
classifying the sentence:
Final model classification. Using the trained deep learning
“Patient complains of persistent headaches and blurred model, which has been fine-tuned through the AL
vision over the past few days.” process, each sentence is classified into the appropriate
SOAP category:
This sentence is first labeled by the proposed AL model,
then reviewed by a human expert to ensure accurate classi- • Subjective: “Patient describes feeling extremely fatigued
fication. It is then added back into the training dataset. This and having a persistent dry cough for the last two
process continues iteratively, refining the model’s ability to weeks.”
classify clinical notes accurately. • Objective: “Physical examination indicates decreased
breath sounds in the right lung.”
Annotation by model and verified by oracle. • Assessment: “The assessment is that the patient may
have pneumonia.”
• Subjective: “Patient complains of persistent headaches • Plan: “Plan: Start antibiotics and order a chest X-ray.”
and blurred vision over the past few days.”
Explanation: This sentence is identified as patient- Conclusions, limitations, and future works
reported symptoms, fitting the subjective category. The vast availability of unstructured clinical data offers an
• Objective: “Physical examination shows no neurological opportunity to extract meaningful information for the appli-
deficits, but blood pressure is significantly elevated.” cations that support the process of clinical decision-making.
Explanation: This sentence describes observations made However, extracting the relevant information from unstruc-
during the physical examination, fitting into the object- tured text into a clinically useful format is a big challenge.
ive category. Therefore, this work targeted this aspect of information
• Assessment: “The assessment is that the patient may be extraction into a well-known protocol (SOAP) used as an
experiencing a hypertensive crisis.” information container. The clinical text in the form of
Explanation: This sentence provides the clinician’s diag- SOAP structure enhances information readability, and the
nostic impression, aligning with the assessment individual sentences, that is, subjective, objective, assess-
category. ment, and plan, can be used in other add-on applications
• Plan: “Plan: recommend lifestyle changes, initiate anti- such as clinical decision support systems. Additionally, it
hypertensive therapy, and schedule follow-up.” helps organizations develop multiple individualistic
Explanation: This sentence outlines treatment and systems such as diagnostic, treatment, and prognostic by
follow-up actions, categorizing it under the plan section. utilizing the relevant SOAP section.
Despite the promising results, our study has a few limita-
These additional labeled instances enhance the training tions. Firstly, the performance of the proposed model
dataset, improving the model’s accuracy and reliability. heavily relies on the availability of high-quality labeled
data. The quality and accuracy of the annotations can sig-
nificantly impact the classification performance.
Clinical note 3: Final testing of the model Additionally, the generalizability of our model may be
Clinical note: The patient describes feeling extremely fati- limited to the specific context of the i2b2 dataset and may
gued and having had a persistent dry cough for the last require adaptation when applied to other datasets or clinical
two weeks. Physical examination indicates decreased settings. Furthermore, the proposed methodology assumes
breath sounds in the right lung. The assessment is that the the SOAP framework, and its effectiveness may vary
20 DIGITAL HEALTH
when applied to different medical protocols or classification 2023-00259004) supervised by the IITP(Institute for Information
tasks. & communications Technology Planning & Evaluation) and by
In future research, an interesting direction to explore is Institute of Information & communications Technology Planning
the incorporation of prompt engineering techniques based & Evaluation (IITP) grant funded by the Korea government(MSIT)
(RS-2022-II220078, Explainable Logical Reasoning for Medical
on large language models (LLMs). LLM-based prompt
Knowledge Generation), (RS-2017-II170655, Lean UX core
engineering has shown promising results in improving the
technology and platform for any digital artifacts UX evaluation).
performance of language models on various NLP tasks.
Integrating LLM techniques into the proposed method-
ology for clinical text annotation and classification could Guarantor: Muhammad Afzal.
yield further improvements.
One potential approach is to leverage LLMs to generate Patient consent statement: This research utilized a publicly
informative prompts for AL. These generated prompts can available dataset that did not contain any directly identifiable
help direct the annotation effort toward more informative patient information. Therefore, informed patient consent was not
samples, enhancing the efficiency and effectiveness of the required for this study.
AL model. Furthermore, LLMs can be used to refine and
adapt the transformer-based model for the specific domain
ORCID iDs: Jamil Hussain https://orcid.org/0000-0003-3862-
of clinical notes and SOAP classification. Prompt-based
8787
fine-tuning techniques could be explored to optimize the Asim Abbas https://orcid.org/0000-0001-6374-0397
models’ performance on the SOAP classification task.
Additionally, exploring the combination of AL and
LLM-based prompt engineering can lead to enhanced annota- References
tion quality and model performance. By leveraging the con- 1. Yao L, Mao C and Luo Y. Clinical text classification with rule-
textual knowledge and capabilities of LLMs, models can based features and knowledge-guided convolutional neural net-
better understand the clinical context and improve the accur- works. BMC Med Inform Decis Mak 2019; 19: 31–39.
acy of the generated annotations during the AL iterations. 2. Liang J, Tsou C-H and Poddar A. A novel system for extract-
Overall, incorporating LLM-based prompt engineering ive clinical note summarization using EHR data. In:
Proceedings of the 2nd clinical natural language processing
techniques into the proposed methodology has the potential
workshop. Minneapolis, MN: Association for Computational
to further advance the field of clinical text annotation and
Linguistics, 2019, pp.46–54.
classification. It can enhance the efficiency, accuracy, and 3. Li I, Yasunaga M, Nuzumlalı MY, et al. A neural
generalizability of models, making them more robust in topic-attention model for medical term abbreviation disam-
handling variations in clinical notes and improving their biguation. ArXiv-> Computer Science > Computation and
performance on the SOAP framework. Language 2019. Epub ahead of print 2019. DOI: 10.48550/
ARXIV.1910.14076
Contributorship: MA, JH, and AB contributed to 4. Seyedmostafa S, Miotto R, Dudley JT, et al. Natural language
conceptualization. MA, JH, and AB contributed to data curation. processing of clinical notes on chronic diseases: systematic
MA, JH, and AB contributed to investigation. JH, MA, and AB review. JMIR Med Inform 2019; 7: e12239.
contributed to methodology. JH, MA, and AB contributed to 5. Mowery D, Wiebe J, Visweswaran S, et al. Building an auto-
writing original draft. All authors have read, reviewed, and mated SOAP classifier for emergency department reports. J
approved the final manuscript. Biomed Inform 2012; 45: 71–81.
6. Weng W-H, Wagholikar KB, McCray AT, et al. Medical sub-
domain classification of clinical notes using a machine
Declaration of conflicting interests: The authors declared no learning-based natural language processing approach. BMC
potential conflicts of interest with respect to the research, Med Inform Decis Mak 2017; 17: 1–13.
authorship, and/or publication of this article. 7. Wang Y, Wang L, Rastegar-Mojarad M, et al. Clinical infor-
mation extraction applications: a literature review. J Biomed
Ethical approval: This study did not require ethics committee Inform 2018; 77: 34–49.
review and approval. 8. Wang Y, Sohn S, Liu S, et al. A clinical text classification
paradigm using weak supervision and deep representation.
BMC Med Inform Decis Mak 2019; 19: 1–13.
Funding: The authors disclosed receipt of the following financial 9. Irena S and Nenadic G. Clinical text data in machine learning:
support for the research, authorship, and/or publication of this systematic review. JMIR Med Inform 2020; 8: e17984.
article: This research was supported by the MSIT(Ministry of 10. Kholghi M, Sitbon L, Zuccon G, et al. Active learning reduces
Science and ICT), Korea, under the Grand Information annotation time for clinical concept extraction. Int J Med
Technology Research Center support program(IITP-2024-RS- Inform 2017; 106: 25–31.
2020-II201489) and was supported by the MSIT(Ministry of 11. Searle T, Kraljevic Z, Bendayan R, et al. MedCATTrainer: a
Science and ICT), Korea, under the ITRC (Information biomedical free text annotation interface with active learning
Technology Research Center) support program(IITP-2024-RS- and research use case specific customisation. In Proceedings
Afzal et al. 21
of the 2019 Conference on Empirical Methods in Natural 25. Mikolov T, Chen K, Corrado G, et al. Efficient estimation of
Language Processing and the 9th International Joint word representations in vector space. 2013. Epub ahead of
Conference on Natural Language Processing (EMNLP- print 2013. DOI: 10.48550/ARXIV.1301.3781
IJCNLP): System Demonstrations, pp.139–144. Hong 26. Pennington J, Socher R and Manning CD. Glove: global vectors
Kong, China: Association for Computational Linguistics, for word representation. In: Proceedings of the 2014 conference
2019. Epub ahead of print 2019. DOI: 10.48550/ARXIV. on empirical methods in natural language processing (EMNLP).
1907.07322 October 25-29, 2014, pp. 1532–1543. Doha, Qatar.
12. Schuyler PL, Hole WT, Tuttle MS, et al. The UMLS metathe- 27. Devlin J, Chang M-W, Lee K, et al. BERT: pre-training of
saurus: representing different views of biomedical concepts. deep bidirectional transformers for language understanding.
Bull Med Libr Assoc 1993; 81: 217. ArXiv–> Computer Science > Computation and Language
13. Whetzel PL, Noy NF, Shah NH, et al. Bioportal: enhanced 2018. Epub ahead of print 2018. DOI: 10.48550/ARXIV.
functionality via new Web services from the National Center 1810.04805
for Biomedical Ontology to access and use ontologies in soft- 28. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomed-
ware applications. Nucleic Acids Res 2011; 39: W541–W545. ical language representation model for biomedical text
14. Khan J and Lee Y-K. LeSSA: a unified framework based on mining. Bioinformatics. 2019. Epub ahead of print
lexicons and semi-supervised learning approaches for September 2019. DOI: 10.1093/bioinformatics/btz682
textual sentiment classification. Applied Sciences 2019; 9: 29. Huang K, Altosaar J and Ranganath R. ClinicalBERT: model-
5562. Epub ahead of print 2019. DOI: 10.3390/app9245562 ing clinical notes and predicting hospital readmission. ArXiv–
15. Beltagy I, Lo K and Cohan A. SciBERT: a pretrained lan- > Computer Science > Computation and Language 2019.
guage model for scientific text. ArXiv–> Computer Epub ahead of print 2019. DOI: 10.48550/ARXIV.1904.05342
Science > Computation and Language 2019. Epub ahead of 30. Eriksen MB and Frandsen TF. The impact of patient, interven-
print 2019. DOI: 10.48550/ARXIV.1903.10676 tion, comparison, outcome (PICO) as a search strategy tool on
16. Biomedical Informatics (DBMI) at Harvard Medical D. i2b2: literature search quality: a systematic review. J Med Libr
informatics for integrating biology and the bedside. https:// Assoc 2018; 106: 420.
www.i2b2.org/NLP/DataSets/Main.php 31. Abbas A, Hussain J, Afzal M, et al. Explicit and implicit
17. Hussain M, Satti FA, Hussain J, et al. A practical approach section identification from clinical discharge summaries. In:
towards causality mining in clinical text using active transfer 2022 16th International Conference on Ubiquitous
learning. J Biomed Inform 2021; 123: 103932. Information Management and Communication (IMCOM).
18. Church KW. Word2Vec. Nat Lang Eng 2017; 23: 155–162. 3–5 Jan 2022, pp. 1–8. Seoul, South Korea.
19. An N, Xiao Y, Yuan J, et al. Extracting causal relations from the 32. Abbas A, Afzal M, Hussain J, et al. Clinical Concept extrac-
literature with word vector mapping. Comput Biol Med 2019; tion with lexical semantics to support automatic annotation.
115: 103524. Int J Environ Res Public Health 2021; 18: 10564. Epub
20. Li M, Scaiano M, El Emam K, et al. Efficient active learning ahead of print 2021. DOI: 10.3390/ijerph182010564
for electronic medical record de-identification. AMIA Summits 33. Schröder C, Müller L, Niekler A, et al. Small-text: active
on Translational Science Proceedings 2019; 2019: 462. learning for text classification in Python. ArXiv–>
21. Tomanek K and Hahn U. Annotation time stamps—temporal Computer Science > Machine Learning 2021. Epub
metadata from the linguistic annotation process. In: Proceedings ahead of print 2021. DOI: 10.48550/ARXIV.2107.10314
of the Seventh International Conference on Language Resources 34. Lewis DD and Gale WA. A sequential algorithm for training
and Evaluation (LREC’10). May 17-23, 2010, Valletta, Malta: text classifiers. In: SIGIR’94. July 3–6, 1994, pp. 3–12.
European Language Resources Association (ELRA), 2010. Dublin Ireland.
http://www.lrec-conf.org/proceedings/lrec2010/pdf/652_Paper. 35. Bahdanau D, Cho K and Bengio Y. Neural machine transla-
pdf tion by jointly learning to align and translate. ArXiv->
22. Yukun Chen B, Denny JC, Hua Xu M, et al. Active learning Computer Science > Computation and Language 2014.
for named entity recognition in clinical text. J Biomed Inform Epub ahead of print 2014. DOI: 10.48550/ARXIV.1409.0473
2015; 58: 11–18. DOI: 10.1016/j.jbi.2015.09.010. 36. Neumann M, King D, Beltagy I, et al. ScispaCy: Fast and
23. Zhou S, Chen Q and Wang X. Active deep learning method for Robust Models for Biomedical Natural Language
semi-supervised sentiment classification. Neurocomputing Processing. In Proceedings of the 18th BioNLP Workshop
2013; 120: 536–546. and Shared Task, pp.319-327, Florence, Italy. Association
24. Hajmohammadi MS, Ibrahim R, Selamat A, et al. for Computational Linguistics, 2019.
Combination of active learning and self-training for cross- 37. de Oliveira JM, Antunes RS and da Costa CA. SOAP classifier
lingual sentiment classification with density analysis of for free-text clinical notes with domain-specific pre-trained lan-
unlabelled samples. Inf Sci (N Y) 2015; 317: 67–77. guage models. Expert Syst Appl 2024; 245: 123046.