0% found this document useful (0 votes)
69 views

Large Language Models For Data Annotation - A Survey

Uploaded by

romanjaimesc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Large Language Models For Data Annotation - A Survey

Uploaded by

romanjaimesc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Large Language Models for Data Annotation: A Survey

Zhen Tan♠∗ Alimohammad Beigi♠∗ Song Wang♣ Ruocheng Guo♦ Amrita Bhattacharjee♠
Bohan Jiang♠ Mansooreh Karami♠ Jundong Li♣ Lu Cheng ♥ Huan Liu♠

School of Computing, and Augmented Intelligence, Arizona State University

Department of Electrical and Computer Engineering, the University of Virginia

ByteDance Research ♥ Department of Computer Science, University of Illinois Chicago
{ztan36,abeigi,abhatt43,bjiang14,mkarami,huanliu}@asu.edu
{sw3wv,jundong}@@virginia.edu
[email protected], [email protected]
Abstract preference labels to tailor outputs to specific crite-
ria or user needs, ❺ annotating entity relationships
Data annotation is the labeling or tagging of
raw data with relevant information, essential to understand how entities within a dataset interact
for improving the efficacy of machine learn- with each other (Wadhwa et al., 2023), ❻ marking
arXiv:2402.13446v1 [cs.CL] 21 Feb 2024

ing models. The process, however, is labor- semantic roles to define the underlying roles that
intensive and expensive. The emergence of entities play in a sentence (Larionov et al., 2019),
advanced Large Language Models (LLMs), ex- and ❼ tagging temporal sequences to capture the
emplified by GPT-4, presents an unprecedented order of events or actions (Yu et al., 2023).
opportunity to revolutionize and automate the
Data annotation poses significant challenges for
intricate process of data annotation. While ex-
isting surveys have extensively covered LLM current machine learning models due to the com-
architecture, training, and general applications, plexity, subjectivity, and diversity of data, requir-
this paper uniquely focuses on their specific ing domain expertise and the resource-intensive
utility for data annotation. This survey con- nature of manually labeling large datasets. Ad-
tributes to three core aspects: LLM-Based Data vanced LLMs such as GPT-4 (OpenAI, 2023),
Annotation, Assessing LLM-generated Anno- Gemini (Team et al., 2023) and Llama-2 (Touvron
tations, and Learning with LLM-generated an-
et al., 2023b) offer a promising opportunity to revo-
notations. Furthermore, the paper includes an
in-depth taxonomy of methodologies employ-
lutionize data annotation. LLMs serve as more than
ing LLMs for data annotation, a comprehen- just tools but play a crucial role in improving the ef-
sive review of learning strategies for models fectiveness and precision of data annotation. Their
incorporating LLM-generated annotations, and ability to automate annotation tasks (Zhang et al.,
a detailed discussion on primary challenges and 2022), ensure consistency across large volumes of
limitations associated with using LLMs for data data (Hou et al., 2023), and adapt through fine-
annotation. As a key guide, this survey aims to tuning or prompting for specific domains (Song
direct researchers and practitioners in explor-
et al., 2023), significantly reduces the challenges
ing the potential of the latest LLMs for data
annotation, fostering future advancements in encountered with traditional annotation methods,
this critical domain. We provide a compre- setting a new standard for what is achievable in
hensive papers list at https://github.com/ the realm of NLP. This survey delves into the nu-
Zhen-Tan-dmml/LLM4Annotation.git. ances of using LLMs for data annotation, explor-
ing methodologies, learning strategies, and asso-
1 Introduction ciated challenges in this transformative approach.
In the complex realm of machine learning and NLP, Through this exploration, our goal is to shed light
data annotation stands out as a critical yet chal- on the motivations behind embracing LLMs as cata-
lenging step, transcending simple label attachment lysts for redefining the landscape of data annotation
to encompass a rich array of auxiliary predictive in machine learning and NLP.
information. This detailed process typically in- We navigate the terrain of leveraging the latest
volves ❶ categorizing raw data with class or task breed of LLMs for data annotation. The survey
labels for basic classification, ❷ adding intermedi- makes four main contributions:
ate labels for contextual depth (Yu et al., 2022), ❸ • LLM-Based Data Annotation: We dive into the
assigning confidence scores to gauge annotation re- specific attributes (e.g., language comprehen-
liability (Lin et al., 2022), ❹ applying alignment or sion, contextual understanding), capabilities (e.g.,

Equal contribution. text generation, contextual reasoning), and fine-
tuning or prompting strategies (e.g., prompt en- Preliminaries LLM-Based Data Annotation
gineering, domain-specific fine-tuning) of newer Scenarios Manually Engineered Prompts

LLMs like GPT-4 and Llama-2 that make them Fully Supervised
Zero-Shot Few-Shot
Learning
uniquely suited for annotation tasks. Semi-Supervised
• Assessing LLM-Generated Annotations: We ex- Learning Alignment via Pairwise Feedback

plore various methods for assessing annotation Unsupervised Human Automated


Learning Feedback Feedback
quality and how to choose high-quality annota-
tions from numerous options. Techniques
• Learning with LLM-Generated Annotations: We Input-Output Assessing LLM-Generated
Prompting (IOP)
investigate the methodologies to train machine Annotations
In-Context
learning models based on annotations generated Learning (ICL) Evaluation
by LLMs, assessing the quality, reliability, and Chain-of-Thought Human Automated
Prompting (CoT)
impact on downstream tasks. Centric (Task-speific)

• Challenges and Ethical Considerations: We iden- Instruction


Tuning (IT)
tify and discuss challenges ranging from techni- Data Selection
Alignment Via Active Learning
cal limitations such as sampling bias and hallu- Tuning (AT)
cination to ethical dilemmas like social bias and
the broader societal implications. Learning with LLM-Generated Annotations
Focusing on this underrepresented aspect of LLM
Target Domain Knowledge In-Context
application, the survey aims to serve as a valuable Inference Distillation Learning (ICL)

guide for academics and practitioners, who intend Chain-of-Thought


Predicting Labels Prompting (CoT)
to deploy LLMs for Annotation. Note that in this Fine-Tuning
Inferring Additional and Instruction Tuning (IT)
survey, we primarily focus on pure language mod- Attributes Prompting
Alignment Tuning (AT)
els. We thus have not considered recently emerg-
ing multimodal LLMs, such as LLaVA (Liu et al.,
Figure 1: The structure of this survey.
2023b). Figure 1 illustrates the general structure
of this survey. A list of potential tools for utilizing
els: an annotator model, denoted as A, which maps
LLMs for annotation is included in Appendix A
input data to annotations, and a task learner, repre-
with explanatory examples.
sented as L, that learns from these annotated data
Differences from Other LLM-related Surveys
to accomplish specific tasks. Our primary focus is
While existing surveys on LLMs extensively cover
on utilizing advanced LLMs like GPT-4 (OpenAI,
architectural nuances (Zhao et al., 2023), training
2023) and LLaMA (Touvron et al., 2023a) as anno-
methodologies (Liu et al., 2023d), knowlegde edit-
tators (A), while the task learner (L) may involve
ting (Wang et al., 2023c), and evaluation proto-
a less complex model such as BERT (Devlin et al.,
cols (Chang et al., 2023) associated with LLMs,
2018), which learns from these annotated data to
their main focus lies on the capabilities of mod-
perform designated tasks. LLM-generated annota-
els for specific end tasks such as machine trans-
tions encompass categorical labels and enhance raw
lation (Min et al., 2021), alignment (Wang et al.,
data points with a comprehensive array of auxiliary
2023d), code generation (Zan et al., 2023), and
signals. These annotations, including confidence
medicine (Thirunavukarasu et al., 2023). In con-
scores, contextual details, and other metadata, ex-
trast, this survey distinguishes itself by placing an
tend beyond traditional categorical labels.
emphasis on the application of these potent next-
generation LLMs to the intricate realm of data an- 2.2 Scenarios
notation, a domain that is crucial yet underexplored.
Given the diverse range of NLP tasks, we primarily
2 Notations and Preliminaries focus on classification tasks in this survey. How-
In this section, we introduce significant notations ever, our approach can be extended to other do-
utilized in this paper and preliminaries. The nota- mains, such as text generation, where an explicit
tions and their definitions can be found in Table 1. label y might not be applicable. To illustrate our
approach, let Du = xi i = 1N denote an unlabeled
2.1 Problem framework data pool and Dl = (xj , yj )j = 1M a manually la-
In this section, we delve into our approach to the beled dataset, where N and M represent their sizes,
annotation process. We introduce two core mod- which can vary across scenarios. In classification
Table 1: Notations and the corresponding descriptions. be manually or algorithmically generated using a
Notations Definitions or Descriptions function H, expressed as p = H(D, x).
⊕ Concatenation operator. Input-Output Prompting (IO) (Kojima et al.,
x A data point.
y A ground truth label. 2022) serves as the fundamental interaction mode
ŷ A predicted label. with an LLM, denoted by the function F. A prompt
A An annotator model used for annotation. p is provided to obtain an output o = A(p).
L A task learner that learns a specific task.
p A prompt. In-Context Learning (ICL) builds upon IO by
o An output of an LLM. enriching the prompt with a sequence of demon-
r A reasoning pathway.
I An instruction(s) generated by humans.
strations, or example pairs, E = {(xe , oe )}E e=1 ,
q A description of a specific task. thus guiding the LLM toward a desired output
z A human preference score. o = A(E ⊕ p).
D A dataset.
Du An unlabeled dataset. Chain-of-Thought Prompting (CoT) further en-
Dl A manually labeled dataset. hances ICL by appending a reasoning pathway
Dgu The Du augmented by LLM annotations. re to each demonstration in E, resulting in E =
Dgl The Dl augmented by LLM annotations.
N The size of an unlabeled dataset. {(xe , re , oe )}E
e=1 . This augmentation can improve
M The size of a manually labeled dataset. the LLM’s inference capabilities.
E A sequence of demonstrations. Note that ⊕ denotes concatenation, implying
α(xi , L) An acquisition function.
H(D, x) A prompt generation function. that in both ICL and CoT, the example pairs E are
integrated into the prompt p to form an extended
tasks, we explore the following settings: prompt. Additionally, it’s noteworthy that ICL can
1. Fully Supervised Learning: M > 0, N = 0. be regarded as a specialized form of IO, and CoT
The annotator A generates auxiliary signals as a specialized form of ICL.
for data points in Dl and transforms it into Instruction Tuning (IT) is introduced to fine-tune
Dgl . Formally, Dgl = {xj , yj , oj }M
j=1 , where LLMs based on task-specific instructions, enabling
oj = A(xj ). The learner L is then trained on them to generalize across various downstream tasks.
Dgl . For example, in a sentiment analysis task, The process can be formulated as o = A(q ⊕ p),
the attribute oj generated by A could highlight where q represents the task description.
key phrases and sentiment intensity in movie Alignment Tuning (AT) aims to fine-tune LLMs
reviews, helping the task learner L classify re- to align their behaviors with human preferences. In
views accurately as positive or negative. addition to human-labeled data, researchers utilize
2. Unsupervised Learning: M = 0, N > 0. In LLM-generated annotations for fine-tuning. Gen-
this case, A operates on Du to produce Dgu de- erally, the LLM-based annotation process can be
fined as Dgu = {xi , oi }N
i=1 , where oi = A(xi ).
represented as z = A(q ⊕ x1 ⊕ x2 ⊕ p), where x1
The task learner L is trained on this dataset. and x2 denote two candidate responses generated
by LLMs, and q represents the task description. z
3. Semi-Supervised Learning: M > 0, N > 0,
represents a score indicating human preference and
and usually N ≫ M . Here, the annotator A can
is typically modeled as a value between 0 and 1.
operate on either or both Dl and Du to produce
This rating zj is generated according to a specific
a combined dataset Dg . The task learner L is
reward R and indicates a human-based compari-
then trained on Dg .
son for the better candidate response xz , where
These scenarios share two common elements: (1) R(q, xz ) > R(q, x1−z ) (Dubois et al., 2023).
Annotation processes by the LLM annotator A and
(2) Learning strategies for L based on A’s annota- 3 LLM-Based Data Annotation
tions. Subsequent sections detail a novel taxonomy The emergence of Large Language Models has
that organizes methods according to these aspects. sparked significant interest in their capacity for
A collection of taxonomized papers are presented high-quality, context-sensitive data annotation.
in Appendix B. This section explores the diverse techniques and
methodologies used for data annotation via LLMs.
2.3 Prompt & Tuning Techniques for LLMs
This subsection formalizes techniques commonly 3.1 Manually Engineered Prompts
utilized in interactions with LLMs. Given an input Manually engineered prompts are essential for
x and a task-specific dataset D, a prompt p can LLMs in annotation tasks, designed to elicit spe-
cific annotations. They are categorized as either on specific LLM responses (Ziegler et al., 2019).
zero-shot which lack demonstrations or few-shot Despite its effectiveness, this approach is expensive
which include them (Dong et al., 2023). and demands considerable effort (Bakker et al.,
Zero-shot. In the early stages of LLM research, 2022). Initiatives like Sparrow (Glaese et al.,
zero-shot prompts gained traction due to their sim- 2022) set standards for human annotators, yet
plicity and effectiveness. Formally, annotations are discrepancies between researcher intentions and
derived by mapping a carefully designed prompt annotator perceptions may affect feedback quality.
q to an annotation o = A(q). The prompt may
include an instruction I outlining the task along Automated Feedback. Consequently, recent ad-
with a ground truth label y. For instance, the study vancements aim to automate the feedback mech-
by ZEROGEN (Ye et al., 2022) shows the utility of anism, frequently utilizing another LLM or the
zero-shot prompts, using phrases like “The movie same LLM to annotate distinct outputs (Bakker
review with positive sentiment is:” to guide the et al., 2022; Wang et al., 2023b). This method-
LLM in generating text x aligned with the label y. ology typically involves an LLM functioning as
Few-shot. This category involves employing In- a reward model, informed by human preference
Context Learning (ICL) to generate annotations. data (Menick et al., 2022). For example, OpenAI
ICL can be viewed as an advanced form of prompt and DeepMind have implemented the 6B GPT-3
engineering that combines human-generated in- and 7B Gopher models, respectively, as reward
structions I with demonstrations sampled from Dl . models. Various studies have delved into diverse
In few-shot scenarios, the selection of demonstra- facets of this automated method. For instance, re-
tion samples is crucial (Liu et al., 2023c). For search by Stiennon et al. (2020) collected human
instance, in few-shot semantic parsing, GPT-3 is comparative judgments of summaries to train a re-
utilized by Shin et al. (2021) to select random ward model. This model was then leveraged to re-
samples from the training set as demonstrations. fine a summarization policy through reinforcement
Another approach by Rubin et al. (2022) uses a learning. Furthermore, Askell et al. (2021) evalu-
scoring LLM A to evaluate the potential useful- ated different training goals for the reward model,
ness of demonstration samples. Here, given a discovering that ranked preference modeling tends
target instance (xi , yi ), the model evaluates the to improve with model size more effectively than
score of a candidate sample (xj , yj ) ∼ Dl as imitation learning. This model utilizes assorted so-
P robA (yi |xi , (xj , yj )). These scores are used to cial welfare functions to amalgamate these personal
train an unsupervised demonstration retriever, ini- preferences. The most current research (Rafailov
tializing from BERT-base via contrastive learning. et al., 2023) employed the Bradley-Terry model for
Furthermore, there are efforts that integrate other instructing LLMs to assess choices made by human
types of annotations into ICL. For example, Su- annotators.
perICL (Xu et al., 2023) incorporates confidence 4 Assessing LLM-Generated Annotations
scores from a smaller language model into demon-
Effective evaluation of annotations generated by
strations, further enhancing the annotation process.
LLMs is crucial to fully harness their potential.
3.2 Alignment via Pairwise Feedback This section focuses on two main aspects:
The importance of aligning LLMs with human- 4.1 Evaluating LLM-Generated Annotations
centric attributes has become increasingly recog- This subsection explores various methods for as-
nized. These attributes, which include Helpfulness, sessing annotation quality, ranging from human-led
Honesty, and Harmlessness, are essential for LLMs to automated approaches.
intended for public interaction, beyond their in- General Approaches: Research has investigated
herent NLP skills (Zhao et al., 2023). Traditional diverse methods for evaluating LLM annotations.
unsupervised learning methods, such as next-word The “Turking Test” by Efrat and Levy (2020), eval-
prediction, fail in instilling these qualities. uates LLMs’ adherence to data annotation guide-
Human Feedback. The dominant strategy for em- lines, with human annotators comparing LLM
bedding these characteristics into LLMs involves outputs against benchmarks like SNLI (Bowman
fine-tuning based on human preferences (Dai et al., 2015), SQuAD (Rajpurkar et al., 2016), and
et al., 2023). A prevalent yet resource-intensive NewsQA (Trischler et al., 2016). Similarly, Hon-
technique requires gathering quantitative feedback ovich et al. (2022a) manually examined the orig-
inality, accuracy, and variety of datasets created 5.1 Target Domain Inference: Direct
by LLMs, focusing on their response to instruc- Utilization of Annotations
tions. Additionally, studies such as by Alizadeh In this section, we explore the practical application
et al. (2023) measure the performance of open- of LLM-generated annotations in diverse down-
source LLMs against human-annotated labels in stream tasks. Annotations, extracted from LLMs
tasks like relevance and topic detection. through carefully designed prompts, provide valu-
Task-Specific Evaluations: Methodologies vary able predictions for a wide range of downstream
by application. For instance, in knowledge graph applications. Such usage can be categorized accord-
enhancement, token ranking metrics assess LLM ing to the definitions in Section 2: a. Supervised:
contributions in fact completion. Additionally, eval- Labels are utilized in any form. b. Unsupervised:
uations of counterfactual generation often utilize di- Annotations function as predictions with no labels
versity metrics like Self-BLEU (Chen et al., 2023), involved, e.g., zero-shot scenarios.
while code generation relies on metrics such as Predicting Labels. Utilizing manually designed
Pass@k (Nijkamp et al., 2022). In scenarios re- prompts, LLMs generate predicted labels in two
quiring extensive datasets, the quality of LLM- distinct manners. First, they predict labels while
generated annotations is compared to gold standard considering demonstration samples, denoted as
labels within a small, labeled subset (Zhao et al., ŷ = A(q(x|D)). Second, they make predictions
2021; Agrawal et al., 2022; He et al., 2023). without reliance on demonstration samples, repre-
4.2 Data Selection via Active Learning sented as ŷ = A(q(x)). Depending on the source
of these demonstration samples, which could be
Choosing high-quality annotations from numerous
either D ⊂ Dl or D ⊂ Du , this can be classified as
options is crucial. Active Learning (AL) emerges
either supervised or unsupervised (Sorensen et al.,
as a key technique, especially when integrating
2022). This technique has enabled LLMs to con-
LLMs into the AL process. This section introduces
tribute to a wide array of tasks, spanning across ar-
pool-based AL within the Learning for Annotation
eas such as reasoning, knowledge bases, causal rea-
framework, where a vast pool of unlabeled data and
soning, recommendation systems, healthcare, and
a smaller set of labeled data exist. AL strategically
even vision-language models (Wei et al., 2022a;
selects the most informative samples from the pool
Kojima et al., 2022; Petroni et al., 2019; Kıcıman
to enhance the learning model’s performance or
et al., 2023; Hou et al., 2023; Gu et al., 2023a).
until reaching a budget limit.
Inferring Additional Attributes. Similarly,
LLMs as Acquisition Functions: Various types
LLMs adeptly correlate prompts with specific at-
of acquisition functions α(xi , L) exist, categorized
tributes or concepts, effectively in both supervised
as (a) Diversity, (b) Uncertainty, and (c) Similarity.
and unsupervised settings (Sorensen et al., 2022).
Notable research in this context includes studies
This capacity proves particularly advantageous in
by Shelmanov et al. (2021); Tamkin et al. (2022);
the case of models such as Concept Bottleneck
Margatina et al. (2023), each investigating different
Models (Tan et al., 2023c,b), which generate predic-
aspects of using LLMs as acquisition functions.
tions by identifying the underlying concepts. In this
LLMs as Oracle Annotators: Innovative stud-
context, LLMs effectively tackle the issue of lim-
ies (Bansal and Sharma, 2023; Wu et al., 2023a)
ited dataset annotations. In vision-language tasks,
have employed LLMs as oracle annotators in AL
LLMs can be employed to automatically generate
setups, enhancing domain generalization and in-
textual descriptions for image classifications (Rad-
context learning for NLP models. Additionally,
ford et al., 2021; Menon and Vondrick, 2022).
Kim et al. (2023) proposed utilizing LLMs to an-
notate task-specific preferences between input text 5.2 Knowledge Distillation: Bridging LLM
pairs, facilitating joint learning with task labels. and task-specific models
Expanding on the previous discussion regarding the
5 Learning with LLM-Generated direct use of annotations, Knowledge Distillation
Annotations (KD) emerges as an additional approach to harness
The LLM-generated annotations provide a valuable the capabilities of LLMs. KD facilitates the transfer
resource of labeled data for diverse machine learn- of expertise from a larger “teacher” model, typi-
ing tasks. This section explores the methodologies cally an LLM, to a smaller, more focused “student”
in learning with LLM-Generated Annotations. model. This technique enables the student model
to match or even surpass the teacher’s performance, prompts assist LLMs in extrapolating to new, un-
despite lower resource demands. seen tasks without requiring explicit parameter up-
Model Enhancement. Currently, several stud- dates. Although effective, they are generally dif-
ies have embraced KD to enrich a task-specific ficult to achieve (Margatina et al., 2023). There-
learner model, denoted as L, with insights from fore, an effective approach to obtaining helpful
an LLM-based annotator, referred to as A. For prompts based on the annotations generated by
example, research endeavors like (Magister et al., LLMs (Hongjin et al., 2022). As the task instruc-
2022; Fu et al., 2023; Sun et al., 2023; Li et al., tions are crucial for the performance of ICL, multi-
2024) focus on training L using datasets annotated ple works are proposed to automatically generate
by A. Conversely, (Hsieh et al., 2023) employs instructions without the laborious process of human
“task hardship” as auxiliary labels supplied by A manipulations (Zhao et al., 2023). In (Honovich
to enhance the learning process for L. Notably, et al., 2022b), the authors observe that provided
Alpaca (Taori et al., 2023a) and GPT4All (Anand with several demonstration examples, LLMs can
et al., 2023) employ LLM-generated corpora to learn to generate the instructions for various tasks
train their lightweight student models to achieve and thus promote ICL performance. Apart from
impressive performance. methods that utilize LLM-generated annotations as
KD Innovations. In terms of tools, GKD (Tan instructions, other works also explore the possibil-
et al., 2023a) stands out as a recently developed ity of leveraging LLM-generated demonstrations
library that simplifies the KD process with LLMs. for ICL (Dong et al., 2022). Among them, a recent
Advancements in this dynamic field encompass work named synthetic prompting (Shao et al., 2023)
both black-box (Jiang et al., 2023b) and white- has gained traction. This technique constructs new
box (Gu et al., 2023c) LLMs serving as teacher questions based on a given input question’s rea-
models, improvements in efficiency (Jha et al., soning chain, followed by a clustering method to
2023), and expansions into specialized domains select the most diversified and complex demonstra-
such as biomedical knowledge extraction (Gu et al., tions. Utilizing raw text datasets as a warm up,
2023b), code generation (Gunasekar et al., 2023a), (Chen et al., 2022) introduce a method to create
web content filtering (Vörös et al., 2023), and math- self-supervised data that aligns with ICL learning
ematical reasoning (Fu et al., 2023). formats for various downstream tasks.
In summary, the adoption of KD for training task-
specific models offers the dual advantages of de- Chain-of-Thought Prompting. It represents a
creased computational demands and sustained per- specialized method within ICL that specifically
formance, positioning it as a highly promising av- enhances the performance of LLMs on intricate
enue in contemporary natural language processing. reasoning tasks like arithmetic reasoning (Miao
et al., 2021), common-sense reasoning (Talmor
5.3 Harnessing LLM Annotations for et al., 2018), and symbolic reasoning (Wei et al.,
Fine-Tuning and Prompting 2022b). Unlike traditional ICL, CoT introduces
The use of LLM-generated annotations for fine- intermediate reasoning steps in the prompts. These
tuning or prompting in LLM adaptation is increas- steps are designed to contribute meaningfully to-
ingly popular, following Knowledge Distillation ward the final output. This distinction underscores
principles to unlock LLMs’ potential. Studies show the focus of CoT on the mechanics of reasoning.
that larger datasets for supervised fine-tuning en- It is widely evaluated that creating effective CoT
hance LLMs’ generalization (Sanh et al., 2021; Wei prompts is crucial for unlocking LLMs’ intricate
et al., 2021), highlighting the growing importance reasoning capabilities (Dong et al., 2022). As man-
of LLM-annotated data (Wang et al., 2022c). These ual creation of such prompts can be costly and
methods mainly fall into four categories: time-consuming (Wei et al., 2022b), recent works
In-Context Learning. Originating from the GPT-3 have prevalently proposed to automatically gen-
model (Brown et al., 2020), In-Context Learning erate CoT prompts via LLMs. For example, in
(ICL) has been widely employed to boost the per- Zero-shot CoT (Kojima et al., 2022), LLMs are
formance of LLMs across varied tasks. The ap- prompted with “Let’s think step by step” to gener-
proach often employs specially formatted prompts ate reasoning steps, followed by “Therefore, the an-
that include task instructions along with illustra- swer is” to reach the conclusion. Auto-CoT (Zhang
tive demonstrations (Dong et al., 2022). These et al., 2022) refines this approach by applying a
clustering strategy to the training questions to de- aligning them with human expectations (Zhao et al.,
termine the most representative ones for each clus- 2023). However, in practice, collecting human
ter. A related study (Wang et al., 2022a) extends feedback can be usually expensive and labori-
this by taking into account prompt confidence, find- ous (Ziegler et al., 2019). Therefore, existing works
ing that diverse reasoning paths are essential for typically learn a surrogate reward model that can
effective CoT. In another vein, (Fu et al., 2023) imitate human preference in a pair of inputs (pair-
propose to combine LLM-generated CoT and few- wise feedback). To train a reward model for an-
shot demonstrations to preserve ICL capabilities notations, researchers will generally first collect a
while enhancing the reasoning performance on us- labeled pairwise feedback dataset from human an-
ing different prompt formats. (Wang et al., 2023a) notators. Then based on different strategies, many
explore the use of LLM-annotated rationales for algorithms directly learn from Dl (Keskar et al.,
knowledge distillation based on CoT prompting. 2019; Liu et al., 2023a; Korbak et al., 2023), while
Despite irrelevant or vacuous rationales, authors other algorithms (Christiano et al., 2017; Ouyang
use contrastive decoding to significantly improve et al., 2022) learn a surrogate reward model from Dl
the reasoning abilities of student models trained and use it to automatically annotate unlabeled pair-
with this augmented data. wise feedback generated by LLMs. To align LLMs
Instruction Tuning. While ICL adapts LLMs by with annotations, existing works generally leverage
altering the input structure, instruction tuning takes the strategy of reinforcement learning (OpenAI,
a different approach by fine-tuning models on var- 2023; Touvron et al., 2023b), namely RLHF (rein-
ious tasks in a supervised learning context (Zhao forcement learning from human feedback). As a
et al., 2023). Multiple works have demonstrated classic example, InstructGPT (Ouyang et al., 2022)
that LLMs displayed notable capabilities in gener- utilizes the PPO strategy (Schulman et al., 2017),
alizing to unfamiliar tasks after fine-tuning (Chung and in each update computes the Kullback–Leibler
et al., 2022; Muennighoff et al., 2022). However, (KL) divergence between the current LLM output
the process of obtaining high-quality training data and that from the previous update. In this way, the
for Instruction Tuning generally involves a large framework can be optimized in a more robust man-
amount of human effort, which can be imprac- ner. On the other hand, ILQL (Snell et al., 2022)
tical in specific real-world scenarios (Lou et al., explores the application of alignment tuning on
2023). To avoid the laborious process of acquir- LLM-generated annotations under an offline set-
ing human annotations, recent works have resorted ting in contrast to the prevalent online RL scenario.
to LLM-generated annotations. As a classic ex- In GopherCite (Menick et al., 2022), the authors
ample, in Self-Instruct (Wang et al., 2022b), the employ reinforcement learning from human pref-
LLM is prompted to autonomously generate new erences (RLHP) to train QA models that produce
instructional input-output pairs. These are subse- answers and simultaneously cite specific evidence
quently filtered and used for the fine-tuning of a T5 to support their claims, thereby facilitating the eval-
model (Brown et al., 2020). This two-stage pipeline uation of accuracy. More recently, RLAIF (Lee
generates instructions, filters out invalid or redun- et al., 2023) leverages preferences that are labeled
dant instances, and employs the rest for model fine- by an off-the-shelf LLM in lieu of humans, which
tuning. Alpaca (Taori et al., 2023b) leverages LLM- achieves similar performance with using human-
generated annotations in the form of instruction- labeled data.
following demonstrations to fine-tune a LLaMA
model (Touvron et al., 2023a). Notably, the Go- 6 Challenges
pherCite model (Menick et al., 2022) introduces a
reinforcement learning framework to train LLMs In this section, we outline LLM data annotation
to generate annotations in the form of answers sup- challenges, including technical hurdles, accuracy
ported by cited evidence, thereby enhancing the concerns, and societal implications like labor dis-
verifiability of their responses. (Chiang and Lee, placement and bias propagation. Addressing these
2023) present a study on the reliability of using is vital for advancing LLM annotation applications.
LLM-generated annotations for human-like evalua- Compounding Error in Model Imitation. Efforts
tions across various NLP tasks. to bridge the gap in performance between propri-
Alignment Tuning. Alignment tuning aims to etary LLMs like ChatGPT and their open-source
eliminate the undesirable behaviors of LLms by counterparts, such as LLaMA, typically involve en-
hancing the capabilities of the latter through train- in LLM application domains.
ing with outputs from the more robust models (Sun Social Impact. The proliferation of LLM-
et al., 2023; Gunasekar et al., 2023b; Hsieh et al., generated annotations across real-world sectors
2023; Honovich et al., 2022a; Chiang et al., 2023; such as finance (Yang et al., 2023), jurispru-
Geng et al., 2023). While this strategy has yielded dence (Cui et al., 2023), and healthcare (Eloun-
variable outcomes, imitation models often replicate dou et al., 2023) has the potential to significantly
the stylistic elements without achieving the fac- enhance efficiency and productivity. Yet, this au-
tual precision of the superior models (Gudibande tomation introduces societal challenges, particu-
et al., 2023). Research highlights the failure of larly regarding labor displacement, annotation qual-
imitation primarily due to model collapse, where ity, and societal development implications. The
the imitating model gradually diverges from the shift towards automated annotations risks render-
data distribution of the model it seeks to repli- ing human annotator roles redundant, potentially
cate (Shumailov et al., 2023). This divergence is aggravating income disparities and affecting lower-
fueled by two main issues: Statistical approxima- skilled employment sectors (Dillion et al., 2023).
tion error, stemming from a limited sample size, Moreover, despite the speed of LLM annotation
and Functional approximation error, arising from generation, the absence of human insight may re-
constrained model capacity. Both errors tend to sult in outputs lacking depth, leading to biased or
amplify through successive training cycles (Alemo- unfair research findings (Wu et al., 2023b; Abid
hammad et al., 2023). The repercussions of model et al., 2021; Cheng et al., 2021; Li et al., 2023).
collapse and approximation errors extend into the Furthermore, reliance on LLMs for tasks tradition-
societal realm. Disseminating and utilizing LLM- ally managed by humans necessitates a careful ap-
generated annotations with these inaccuracies in fu- proach to ensure technological progress does not
ture model training can lead to data contamination. inadvertently exacerbate social inequalities or di-
This scenario risks undermining LLMs’ trustwor- minish quality standards. Future studies should
thiness over time, impacting their utility in critical aim to harmonize technological advancements with
applications. Addressing these issues in future re- their broader societal consequences.
search is increasingly crucial for constructing the
next-generation of LLMs, or artificial general intel- 7 Conclusion
ligence (AGI), in broader terms.
Impact of Hallucinations on LLM Annotations. The exploration of LLMs for data annotation has
The phenomenon of hallucinations in LLMs signif- revealed an exciting frontier in NLP, presenting
icantly undermines the integrity and reliability of novel solutions to longstanding challenges like data
their generated annotations (Alkaissi and McFar- scarcity, and enhancing annotation quality and pro-
lane, 2023; Azamfirei et al., 2023). Outputs that cess efficiency. This survey meticulously reviews
are detached from actual data can cause misinfor- methodologies, applications, and hurdles associ-
mation and inaccuracies in annotations, posing sub- ated with LLM employment, including innovative
stantial risks in sensitive areas like healthcare, legal strategies such as prompt engineering and domain-
analysis, and financial domains (Jiang et al., 2023a; specific adjustments. It evaluates the effects of
Chen and Shu, 2023). Addressing hallucinations re- LLM-generated annotations on training machine
quires comprehensive strategies, including refining learning models while addressing both technical
the LLM training process to reduce the emergence and ethical concerns like bias and societal ram-
of unfounded content and implementing validation ifications. Highlighting our novel taxonomy of
mechanisms for annotations through automated and LLM methodologies, strategies for utilizing LLM-
manual verification (Liao and Vaughan, 2023; Pan generated annotations, and a critical discussion
et al., 2023; Bian et al., 2023). However, the inher- on the challenges, this work aims to steer future
ent opacity of LLMs complicates efforts to pinpoint progress in this crucial area. Additionally, we intro-
and rectify the causes of hallucinations, posing eth- duce a comprehensive categorization of techniques
ical dilemmas in deploying LLMs for critical an- and compile extensive benchmark datasets to sup-
notation roles. This emphasizes the necessity of port ongoing research endeavors, concluding with
ongoing research to mitigate hallucinations while an examination of persistent challenges and open
balancing performance gains with ethical concerns questions, paving the way for future investigative
pursuits in the domain.
Limitations Human Oversight. Utilize human oversight to
review LLM-generated annotations, ensuring ac-
Sampling Bias and Hallucination. LLMs can dis- curacy, ethical compliance, and mitigating risks of
play sampling bias, leading to incorrect or “halluci- error propagation or biases.
nated” data, impacting the reliability and quality of Continuous Monitoring for Bias and Error. Reg-
annotations for discriminative tasks. ularly evaluate and update LLMs to identify and
Social Bias and Ethical Dilemmas. The inher- correct for biases, inaccuracies, or ethical concerns,
ent biases in training data can be perpetuated and leveraging diverse datasets and feedback mecha-
amplified by LLMs, leading to ethical concerns nisms to improve model fairness and reliability.
and the propagation of social biases through anno- Social Impact and Responsibility. Consider the
tated data. This is particularly problematic in tasks broader social implications of deploying LLMs for
requiring fairness and impartiality. data annotation, including the potential for job dis-
Dependence on High-Quality Data. LLMs’ use- placement and the ethical use of automated systems
fulness in generating annotations depends on large, in sensitive domains. Aim for socially beneficial
high-quality datasets. But curating these datasets is technologies that enhance human well-being.
labor-intensive, posing a scalability challenge for Collaboration and Engagement. Engage with
LLM-based annotation efforts. a broad spectrum of stakeholders, including ethi-
Complexity in Tuning and Prompt Engineering. cists, domain experts, and affected communities, to
Successfully leveraging LLMs for data annotation gather diverse perspectives and insights, ensuring
requires sophisticated prompt engineering and fine- that LLM applications for data annotation serve the
tuning techniques. This can pose a barrier to entry public interest and ethical standards.
for practitioners and researchers without extensive
expertise in NLP and machine learning.
Generalization and Overfitting While LLMs can References
be powerful tools for annotation, there’s a risk of Abubakar Abid, Maheen Farooqi, and James Zou. 2021.
overfitting to the training data, limiting their ability Persistent anti-muslim bias in large language models.
In Proceedings of the 2021 AAAI/ACM Conference
to generalize to unseen data or different contexts. on AI, Ethics, and Society, pages 298–306.
This is a critical limitation for discriminative tasks
where the goal is to develop models that perform Bernardo Aceituno and Antoni Rosinol. 2022. Stack ai:
The middle-layer of ai.
well across diverse datasets and domains.
Computational and Resource Requirements. Monica Agrawal, Stefan Hegselmann, Hunter Lang,
The training and deployment of state-of-the-art Yoon Kim, and David Sontag. 2022. Large language
models are zero-shot clinical information extractors.
LLMs for data annotation require substantial com-
arXiv preprint arXiv:2205.12689.
putational resources, which may not be accessible
to all researchers and organizations, thereby limit- Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo
ing widespread adoption. Luzi, Ahmed Imtiaz Humayun, Hossein Reza Babaei,
Daniel LeJeune, Ali Siahkoohi, and Richard Bara-
niuk. 2023. Self-consuming generative models go
Ethics Statement mad. ArXiv, abs/2307.01850.

Commitment to Fairness. Ensure the develop- Meysam Alizadeh, Maël Kubli, Zeynab Samei,
Shirin Dehghani, Juan Diego Bermeo, Maria Ko-
ment and application of LLMs for data annotation
robeynikova, and Fabrizio Gilardi. 2023. Open-
adheres to ethical principles that promote fairness source large language models outperform crowd
and prevent bias, recognizing the diversity of data workers and approach chatgpt in text-annotation
and avoiding discriminatory outcomes. tasks. arXiv preprint arXiv:2307.02179.
Transparency and Accountability. Maintain Hussam Alkaissi and Samy I McFarlane. 2023. Artifi-
transparency in LLM methodologies, training data, cial hallucinations in chatgpt: implications in scien-
and annotation processes. Provide clear documen- tific writing. Cureus, 15(2).
tation and accountability mechanisms to address Walid Amamou. 2021. Ubiai: Text annotation tool.
potential errors or biases introduced by LLMs.
Yuvanesh Anand, Zach Nussbaum, Brandon Duder-
Privacy and Data Protection. Maintain robust stadt, Benjamin Schmidt, and Andriy Mulyar. 2023.
data privacy protocols, ensuring confidentiality and Gpt4all: Training an assistant-style chatbot with large
consent in training and annotation datasets. scale data distillation from gpt-3.5-turbo. GitHub.
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Lu Cheng, Kush R Varshney, and Huan Liu. 2021. So-
Deep Ganguli, Tom Henighan, Andy Jones, Nicholas cially responsible ai algorithms: Issues, purposes,
Joseph, Ben Mann, Nova DasSarma, et al. 2021. A and challenges. Journal of Artificial Intelligence Re-
general language assistant as a laboratory for align- search, 71:1137–1181.
ment. arXiv preprint arXiv:2112.00861.
Cheng-Han Chiang and Hung-yi Lee. 2023. Can large
Razvan Azamfirei, Sapna R Kudchadkar, and James language models be an alternative to human evalua-
Fackler. 2023. Large language models and the perils tions? arXiv preprint arXiv:2305.01937.
of their hallucinations. Critical Care, 27(1):1–2.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Michiel Bakker, Martin Chadwick, Hannah Sheahan, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Michael Tessler, Lucy Campbell-Gillingham, Jan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
Balaguer, Nat McAleese, Amelia Glaese, John Stoica, and Eric P. Xing. 2023. Vicuna: An open-
Aslanides, Matt Botvinick, et al. 2022. Fine-tuning source chatbot impressing GPT-4 with 90%* chatgpt
language models to find agreement among humans quality.
with diverse preferences. Advances in Neural Infor-
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
mation Processing Systems, 35:38176–38189.
tic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Ad-
Parikshit Bansal and Amit Sharma. 2023. Large lan- vances in neural information processing systems, 30.
guage models as annotators: Enhancing generaliza-
tion of nlp models at minimal cost. arXiv preprint Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
arXiv:2306.15766. ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
Ning Bian, Peilin Liu, Xianpei Han, Hongyu Lin, Yao- 2022. Scaling instruction-finetuned language models.
jie Lu, Ben He, and Le Sun. 2023. A drop of ink arXiv preprint arXiv:2210.11416.
may make a million think: The spread of false in-
formation in large language models. arXiv preprint Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and
arXiv:2305.04812. Li Yuan. 2023. Chatlaw: Open-source legal large
language model with integrated external knowledge
Samuel R Bowman, Gabor Angeli, Christopher Potts, bases. arXiv preprint arXiv:2306.16092.
and Christopher D Manning. 2015. A large annotated
corpus for learning natural language inference. arXiv Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming
preprint arXiv:1508.05326. Ma, Zhifang Sui, and Furu Wei. 2023. Why can gpt
learn in-context? language models secretly perform
Tom Brown, Benjamin Mann, Nick Ryder, Melanie gradient descent as meta-optimizers. In Findings of
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind the Association for Computational Linguistics: ACL
Neelakantan, Pranav Shyam, Girish Sastry, Amanda 2023, pages 4005–4019.
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
systems, 33:1877–1901. Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, ing. arXiv preprint arXiv:1810.04805.
Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi,
Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Danica Dillion, Niket Tandon, Yuling Gu, and Kurt
Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. Gray. 2023. Can ai language models replace human
2023. A survey on evaluation of large language mod- participants? Trends in Cognitive Sciences.
els.
Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan,
Shizhe Diao, Jipeng Zhang, Kashun Shum, and
Canyu Chen and Kai Shu. 2023. Can llm-generated Tong Zhang. 2023. Raft: Reward ranked finetuning
misinformation be detected? arXiv preprint for generative foundation model alignment. arXiv
arXiv:2309.13788. preprint arXiv:2304.06767.
Mingda Chen, Jingfei Du, Ramakanth Pasunuru, Todor Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy-
Mihaylov, Srini Iyer, Veselin Stoyanov, and Zor- ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and
nitsa Kozareva. 2022. Improving in-context few-shot Zhifang Sui. 2022. A survey for in-context learning.
learning via self-supervised training. In NAACL. arXiv preprint arXiv:2301.00234.

Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang,
Sabharwal, and Kyle Richardson. 2023. Disco: Dis- Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy
tilling counterfactuals with large language models. Liang, and Tatsunori B Hashimoto. 2023. Al-
In Proceedings of the 61st Annual Meeting of the pacafarm: A simulation framework for methods
Association for Computational Linguistics (Volume that learn from human feedback. arXiv preprint
1: Long Papers), pages 5514–5528. arXiv:2305.14387.
Avia Efrat and Omer Levy. 2020. The turking test: Can Xingwei He, Zhenghao Lin, Yeyun Gong, Alex Jin,
language models understand instructions? arXiv Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan
preprint arXiv:2010.11982. Duan, Weizhu Chen, et al. 2023. Annollm: Making
large language models to be better crowdsourced
Tyna Eloundou, Sam Manning, Pamela Mishkin, and annotators. arXiv preprint arXiv:2303.16854.
Daniel Rock. 2023. Gpts are gpts: An early look at
the labor market impact potential of large language SU Hongjin, Jungo Kasai, Chen Henry Wu, Weijia Shi,
models. arXiv preprint arXiv:2303.10130. Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf,
Luke Zettlemoyer, Noah A Smith, et al. 2022. Selec-
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and tive annotation makes language models better few-
Tushar Khot. 2023. Specializing smaller language shot learners. In ICLR.
models towards multi-step reasoning. arXiv preprint
arXiv:2301.12726. Matthew Honnibal and Ines Montani. 2017. spaCy 2:
Natural language understanding with Bloom embed-
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wal- dings, convolutional neural networks and incremental
lace, Pieter Abbeel, Sergey Levine, and Dawn Song. parsing. To appear.
2023. Koala: A dialogue model for academic re-
search. BAIR Blog. Or Honovich, Thomas Scialom, Omer Levy, and Timo
Schick. 2022a. Unnatural instructions: Tuning lan-
Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John guage models with (almost) no human labor. arXiv
Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, preprint arXiv:2212.09689.
Laura Weidinger, Martin Chadwick, Phoebe Thacker,
et al. 2022. Improving alignment of dialogue agents Or Honovich, Uri Shaham, Samuel R Bowman, and
via targeted human judgements. arXiv preprint Omer Levy. 2022b. Instruction induction: From few
arXiv:2209.14375. examples to natural language task descriptions. arXiv
preprint arXiv:2205.10782.
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami,
Bailan He, Gengyuan Zhang, Ruotong Liao, Yao
Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu,
Qin, Volker Tresp, and Philip Torr. 2023a. A sys-
Ruobing Xie, Julian McAuley, and Wayne Xin
tematic survey of prompt engineering on vision- Zhao. 2023. Large language models are zero-shot
language foundation models. arXiv preprint rankers for recommender systems. arXiv preprint
arXiv:2307.12980. arXiv:2305.08845.
Yu Gu, Sheng Zhang, Naoto Usuyama, Yonas Wold-
esenbet, Cliff Wong, Praneeth Sanapathi, Mu Wei, Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh,
Naveen Valluri, Erika Strandberg, Tristan Naumann, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner,
et al. 2023b. Distilling large language models for Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister.
biomedical knowledge extraction: A case study on ad- 2023. Distilling step-by-step! outperforming larger
language models with less training data and smaller
verse drug events. arXiv preprint arXiv:2307.06439.
model sizes. arXiv preprint arXiv:2305.02301.
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.
2023c. Knowledge distillation of large language Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell,
models. arXiv preprint arXiv:2306.08543. and Iz Beltagy. 2023. Large language model dis-
tillation doesn’t need a teacher. arXiv preprint
Arnav Gudibande, Eric Wallace, Charles Burton Snell, arXiv:2305.14864.
Xinyang Geng, Hao Liu, P. Abbeel, Sergey Levine,
and Dawn Song. 2023. The false promise of imitating Bohan Jiang, Zhen Tan, Ayushi Nirmal, and Huan
proprietary llms. ArXiv, abs/2305.15717. Liu. 2023a. Disinformation detection: An evolv-
ing challenge in the age of llms. arXiv preprint
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio arXiv:2309.15847.
César Teodoro Mendes, Allie Del Giorno, Sivakanth
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei
de Rosa, Olli Saarikivi, et al. 2023a. Textbooks are Wang. 2023b. Lion: Adversarial distillation of
all you need. arXiv preprint arXiv:2306.11644. closed-source large language model. arXiv preprint
arXiv:2305.12870.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Ce-
sar Teodoro Mendes, Allison Del Giorno, Sivakanth Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
Gopi, Mojan Javaheripi, Piero C. Kauffmann, Gus- Caiming Xiong, and Richard Socher. 2019. Ctrl: A
tavo de Rosa, Olli Saarikivi, Adil Salim, S. Shah, conditional transformer language model for control-
Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, lable generation. arXiv preprint arXiv:1909.05858.
Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and
Yuan-Fang Li. 2023b. Textbooks are all you need. Emre Kıcıman, Robert Ness, Amit Sharma, and Chen-
ArXiv, abs/2306.11644. hao Tan. 2023. Causal reasoning and large language
models: Opening a new frontier for causality. arXiv
Chase Harrison. 2022. Langchain. preprint arXiv:2305.00050.
Jaehyung Kim, Jinwoo Shin, and Dongyeop Kang. large language models’ alignment. arXiv preprint
2023. Prefer to classify: Improving text classifiers arXiv:2308.05374.
via auxiliary preference learning. arXiv preprint
arXiv:2306.04925. Renze Lou, Kai Zhang, and Wenpeng Yin. 2023. Is
prompt all you need? no. a comprehensive and
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- broader view of instruction learning. arXiv preprint
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- arXiv:2303.10475.
guage models are zero-shot reasoners. Advances in
neural information processing systems, 35:22199– Lucie Charlotte Magister, Jonathan Mallinson, Jakub
22213. Adamek, Eric Malmi, and Aliaksei Severyn. 2022.
Teaching small language models to reason. arXiv
Tomasz Korbak, Kejian Shi, Angelica Chen, preprint arXiv:2212.08410.
Rasika Vinayak Bhalerao, Christopher Buck-
ley, Jason Phang, Samuel R Bowman, and Ethan Katerina Margatina, Timo Schick, Nikolaos Aletras, and
Perez. 2023. Pretraining language models with Jane Dwivedi-Yu. 2023. Active learning principles
human preferences. In International Conference on for in-context learning with large language models.
Machine Learning, pages 17506–17533. PMLR. arXiv preprint arXiv:2305.14264.

Daniil Larionov, Artem Shelmanov, Elena Chistova, and Jacob Menick, Maja Trebacz, Vladimir Mikulik,
Ivan Smirnov. 2019. Semantic role labeling with pre- John Aslanides, Francis Song, Martin Chadwick,
trained language models for known and unknown Mia Glaese, Susannah Young, Lucy Campbell-
predicates. In Proceedings of the International Con- Gillingham, Geoffrey Irving, et al. 2022. Teaching
ference on Recent Advances in Natural Language language models to support answers with verified
Processing (RANLP 2019), pages 619–628. quotes. arXiv preprint arXiv:2203.11147.

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Sachit Menon and Carl Vondrick. 2022. Visual classi-
Lu, Thomas Mesnard, Colton Bishop, Victor Car- fication via description from large language models.
bune, and Abhinav Rastogi. 2023. Rlaif: Scaling arXiv preprint arXiv:2210.07183.
reinforcement learning from human feedback with ai
feedback. Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su.
2021. A diverse corpus for evaluating and developing
Dawei Li, Zhen Tan, Tianlong Chen, and Huan Liu. english math word problem solvers. arXiv preprint
2024. Contextualization distillation from large lan- arXiv:2106.15772.
guage model for knowledge graph completion. arXiv
preprint arXiv:2402.01729. Bonan Min, Hayley Ross, Elior Sulem, Amir
Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz,
Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Eneko Agirre, Ilana Heintz, and Dan Roth. 2021.
Wang. 2023. A survey on fairness in large language Recent advances in natural language processing via
models. arXiv preprint arXiv:2308.10149. large pre-trained language models: A survey. ACM
Computing Surveys.
Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai
transparency in the age of llms: A human-centered Ines Montani and Matthew Honnibal. 2018. Prodigy: A
research roadmap. arXiv preprint arXiv:2306.01941. new annotation tool for radically efficient machine
teaching. Artificial Intelligence, to appear.
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.
Teaching models to express their uncertainty in Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
words. arXiv preprint arXiv:2205.14334. Adam Roberts, Stella Biderman, Teven Le Scao,
M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey
Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2023a. Schoelkopf, et al. 2022. Crosslingual generaliza-
Chain of hindsight aligns language models with feed- tion through multitask finetuning. arXiv preprint
back. arXiv preprint arXiv:2302.02676, 3. arXiv:2211.01786.

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan
Lee. 2023b. Visual instruction tuning. arXiv preprint Wang, Yingbo Zhou, Silvio Savarese, and Caiming
arXiv:2304.08485. Xiong. 2022. Codegen: An open large language
model for code with multi-turn program synthesis.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, arXiv preprint arXiv:2203.13474.
Hiroaki Hayashi, and Graham Neubig. 2023c. Pre-
train, prompt, and predict: A systematic survey of OpenAI. 2023. Gpt-4 technical report.
prompting methods in natural language processing.
ACM Computing Surveys, 55(9):1–35. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, 2022. Training language models to follow instruc-
Muhammad Faaiz Taufiq, and Hang Li. 2023d. Trust- tions with human feedback. Advances in Neural
worthy llms: a survey and guideline for evaluating Information Processing Systems, 35:27730–27744.
Yikang Pan, Liangming Pan, Wenhu Chen, Preslav yield few-shot semantic parsers. In Proceedings of
Nakov, Min-Yen Kan, and William Yang Wang. 2023. the 2021 Conference on Empirical Methods in Natu-
On the risk of misinformation pollution with large ral Language Processing, pages 7699–7715.
language models. arXiv preprint arXiv:2305.13661.
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin
Fabio Petroni, Tim Rocktäschel, Patrick Lewis, An- Gal, Nicolas Papernot, and Ross Anderson. 2023.
ton Bakhtin, Yuxiang Wu, Alexander H Miller, and The curse of recursion: Training on generated data
Sebastian Riedel. 2019. Language models as knowl- makes models forget. ArXiv, abs/2305.17493.
edge bases? arXiv preprint arXiv:1909.01066.
Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya and Sergey Levine. 2022. Offline rl for natural lan-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- guage generation with implicit language q learning.
try, Amanda Askell, Pamela Mishkin, Jack Clark, arXiv preprint arXiv:2206.11871.
et al. 2021. Learning transferable visual models from
natural language supervision. In International confer- Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei
ence on machine learning, pages 8748–8763. PMLR. Huang, Yongbin Li, and Houfeng Wang. 2023. Pref-
erence ranking optimization for human alignment.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano arXiv preprint arXiv:2306.17492.
Ermon, Christopher D Manning, and Chelsea Finn.
Taylor Sorensen, Joshua Robinson, Christopher Michael
2023. Direct preference optimization: Your language
Rytting, Alexander Glenn Shaw, Kyle Jeffrey
model is secretly a reward model. arXiv preprint
Rogers, Alexia Pauline Delorey, Mahmoud Khalil,
arXiv:2305.18290.
Nancy Fulda, and David Wingate. 2022. An
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and information-theoretic approach to prompt engineer-
Percy Liang. 2016. Squad: 100,000+ questions ing without ground truth labels. arXiv preprint
for machine comprehension of text. arXiv preprint arXiv:2203.11364.
arXiv:1606.05250. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
2022. Learning to retrieve prompts for in-context Dario Amodei, and Paul F Christiano. 2020. Learn-
learning. In Proceedings of the 2022 Conference ing to summarize with human feedback. Advances
of the North American Chapter of the Association in Neural Information Processing Systems, 33:3008–
for Computational Linguistics: Human Language 3021.
Technologies, pages 2655–2671. Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Ren, Dawei Yin, and Zhaochun Ren. 2023. Is
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine chatgpt good at search? investigating large lan-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun guage models as re-ranking agent. arXiv preprint
Raja, et al. 2021. Multitask prompted training en- arXiv:2304.09542.
ables zero-shot task generalization. arXiv preprint Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
arXiv:2110.08207. Jonathan Berant. 2018. Commonsenseqa: A question
answering challenge targeting commonsense knowl-
John Schulman, Filip Wolski, Prafulla Dhariwal,
edge. arXiv preprint arXiv:1811.00937.
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint Alex Tamkin, Dat Nguyen, Salil Deshpande, Jesse Mu,
arXiv:1707.06347. and Noah Goodman. 2022. Active learning helps
pretrained models learn the intended task. Advances
Zhihong Shao, Yeyun Gong, Yelong Shen, Min- in Neural Information Processing Systems, 35:28140–
lie Huang, Nan Duan, and Weizhu Chen. 2023. 28153.
Synthetic prompting: Generating chain-of-thought
demonstrations for large language models. arXiv Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wen-
preprint arXiv:2302.00618. wen Gong, Yang Yang, Hongyin Tang, Keqing He,
Jiahao Liu, Jingang Wang, Shu Zhao, et al. 2023a.
A Shelmanov, D Puzyrev, L Kupriyanova, N Khromov, Gkd: A general knowledge distillation framework
DV Dylov, A Panchenko, D Belyakov, D Larionov, for large-scale pre-trained language model. arXiv
E Artemova, and O Kozlova. 2021. Active learning preprint arXiv:2306.06629.
for sequence tagging with deep pre-trained models
and bayesian uncertainty estimates. In EACL 2021- Zhen Tan, Tianlong Chen, Zhenyu Zhang, and Huan
16th Conference of the European Chapter of the Asso- Liu. 2023b. Sparsity-guided holistic explanation for
ciation for Computational Linguistics, Proceedings llms with interpretable inference-time intervention.
of the Conference, pages 1698–1712. arXiv preprint arXiv:2312.15033.
Richard Shin, Christopher Lin, Sam Thomson, Charles Zhen Tan, Lu Cheng, Song Wang, Yuan Bo, Jundong
Chen Jr, Subhro Roy, Emmanouil Antonios Platanios, Li, and Huan Liu. 2023c. Interpreting pretrained lan-
Adam Pauls, Dan Klein, Jason Eisner, and Benjamin guage models via concept bottlenecks. arXiv preprint
Van Durme. 2021. Constrained language models arXiv:2311.05014.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Le, Ed Chi, and Denny Zhou. 2022a. Rationale-
and Tatsunori B. Hashimoto. 2023a. Stanford alpaca: augmented ensembles in language models. arXiv
An instruction-following llama model. https:// preprint arXiv:2207.00747.
github.com/tatsu-lab/stanford_alpaca.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann isa Liu, Noah A Smith, Daniel Khashabi, and Han-
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, naneh Hajishirzi. 2022b. Self-instruct: Aligning lan-
and Tatsunori B Hashimoto. 2023b. Stanford alpaca: guage model with self generated instructions. arXiv
An instruction-following llama model. preprint arXiv:2212.10560.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yizhong Wang, Swaroop Mishra, Pegah Alipoor-
Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, molabashi, Yeganeh Kordi, Amirreza Mirzaei,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anjana Arunkumar, Arjun Ashok, Arut Selvan
Anja Hauth, et al. 2023. Gemini: a family of Dhanasekaran, Atharva Naik, David Stap, et al.
highly capable multimodal models. arXiv preprint 2022c. Super-naturalinstructions: Generalization via
arXiv:2312.11805. declarative instructions on 1600+ nlp tasks. arXiv
preprint arXiv:2204.07705.
Arun James Thirunavukarasu, Darren Shu Jeng Ting,
Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xing-
and Daniel Shu Wei Ting. 2023. Large language shan Zeng, Wenyong Huang, Lifeng Shang, Xin
models in medicine. Nature Medicine, pages 1–11. Jiang, and Qun Liu. 2023d. Aligning large lan-
guage models with human: A survey. arXiv preprint
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier arXiv:2307.12966.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
Azhar, et al. 2023a. Llama: Open and effi- Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
cient foundation language models. arXiv preprint drew M Dai, and Quoc V Le. 2021. Finetuned lan-
arXiv:2302.13971. guage models are zero-shot learners. arXiv preprint
arXiv:2109.01652.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Bhosale, et al. 2023b. Llama 2: Open founda- Maarten Bosma, Denny Zhou, Donald Metzler, et al.
tion and fine-tuned chat models. arXiv preprint 2022a. Emergent abilities of large language models.
arXiv:2307.09288. arXiv preprint arXiv:2206.07682.
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris,
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Alessandro Sordoni, Philip Bachman, and Kaheer
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Suleman. 2016. Newsqa: A machine comprehension
et al. 2022b. Chain-of-thought prompting elicits rea-
dataset. arXiv preprint arXiv:1611.09830.
soning in large language models. Advances in Neural
Tamás Vörös, Sean Paul Bergeron, and Konstantin Information Processing Systems, 35:24824–24837.
Berlin. 2023. Web content filtering through knowl-
edge distillation of large language models. arXiv Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
preprint arXiv:2305.05027. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
Somin Wadhwa, Silvio Amir, and Byron C Wallace. icz, Joe Davison, Sam Shleifer, Patrick von Platen,
2023. Revisiting relation extraction in the era of large Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
language models. arXiv preprint arXiv:2305.05003. Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2020. Hug-
Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan gingface’s transformers: State-of-the-art natural lan-
Gao, Bing Yin, and Xiang Ren. 2023a. Scott: guage processing.
Self-consistent chain-of-thought distillation. arXiv
preprint arXiv:2305.01879. Sherry Wu, Hua Shen, Daniel S Weld, Jeffrey Heer, and
Marco Tulio Ribeiro. 2023a. Scattershot: Interactive
Song Wang, Zhen Tan, Ruocheng Guo, and Jundong Li. in-context example curation for text transformation.
2023b. Noise-robust fine-tuning of pretrained lan- In Proceedings of the 28th International Conference
guage models via external guidance. arXiv preprint on Intelligent User Interfaces, pages 353–367.
arXiv:2311.01108.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski,
Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam-
Chen Chen, et al. 2023c. Knowledge editing for badur, David Rosenberg, and Gideon Mann. 2023b.
large language models: A survey. arXiv preprint Bloomberggpt: A large language model for finance.
arXiv:2310.16218. arXiv preprint arXiv:2303.17564.
Canwen Xu, Yichong Xu, Shuohang Wang, Yang Liu, cilitate the annotation process for various NLP
Chenguang Zhu, and Julian McAuley. 2023. Small tasks. One of their primary attributes is an intu-
models are valuable plug-ins for large language mod-
itive and user-friendly interface, allowing engineers
els. arXiv preprint arXiv:2305.08848.
and even non-technical annotators to easily work
Hongyang Yang, Xiao-Yang Liu, and Christina Dan with complex textual data. These tools are built to
Wang. 2023. Fingpt: Open-source financial large support numerous annotation types, from simple bi-
language models. arXiv preprint arXiv:2306.06031.
nary labels to more intricate hierarchical structures.
Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao The main goal of these tools is to simplify the la-
Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. beling process, enhance the quality of the labels,
2022. Zerogen: Efficient zero-shot learning via
dataset generation. In Proceedings of the 2022 Con- and boost overall productivity in data annotation.
ference on Empirical Methods in Natural Language Below, we will present a selection of the libraries
Processing, pages 11653–11669. and tools that support Large Language Models for
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong the annotation process:
Xu, Mingxuan Ju, Soumya Sanyal, Chenguang
Zhu, Michael Zeng, and Meng Jiang. 2022. Gen- • LangChain: LangChain (Harrison, 2022) is
erate rather than retrieve: Large language mod- an open-source library1 that offers an array
els are strong context generators. arXiv preprint of tools designed to facilitate the construc-
arXiv:2209.10063. tion of LLM-related pipelines and workflows.
Xinli Yu, Zheng Chen, Yuan Ling, Shujing Dong, This library specifically provides large lan-
Zongyi Liu, and Yanbin Lu. 2023. Temporal data guage models with agents in order to interact
meets llm–explainable financial time series forecast- effectively with their environment as well as
ing. arXiv preprint arXiv:2306.11025.
various external data sources. Therefore, pro-
Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, viding dynamic and contextually appropriate
Bingchao Wu, Bei Guan, Wang Yongji, and Jian- responses that go beyond a single LLM call.
Guang Lou. 2023. Large language models meet
nl2code: A survey. In Proceedings of the 61st An- In terms of the annotation process, their power
nual Meeting of the Association for Computational mostly lies in the facilitation of annotation
Linguistics (Volume 1: Long Papers), pages 7443–
7464.
through the creation of a modularized struc-
ture called chain. In the chaining technique, a
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex complex problem is broken down into smaller
Smola. 2022. Automatic chain of thought prompt-
sub-tasks. The results obtained from one or
ing in large language models. arXiv preprint
arXiv:2210.03493. more steps are then aggregated and utilized
as input prompts for subsequent actions in the
Mengjie Zhao, Fei Mi, Yasheng Wang, Minglei Li, Xin
chain.
Jiang, Qun Liu, and Hinrich Schütze. 2021. Lm-
turk: Few-shot learners as crowdsourcing workers
in a language-model-as-a-service framework. arXiv • Stack AI: Stack AI (Aceituno and Rosinol,
preprint arXiv:2112.07522. 2022) is a paid service that offers an AI-
powered data platform. It is designed explic-
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
itly for automating business processes allow-
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen ing them to maximize efficiency. The essence
Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, of their platform lies in their ability to visually
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, design, test, and deploy AI workflows through
Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A
smooth integration of Large Language Mod-
survey of large language models.
els. Their user-friendly graphical interface
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B (Figure 2) allows the users to create apps
Brown, Alec Radford, Dario Amodei, Paul Chris- and workflows related to diverse tasks from
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. arXiv content creation and data labeling to conver-
preprint arXiv:1909.08593. sational AI apps and document processing.
Moreover, Stack AI utilizes weakly super-
A LLM-assisted Tools and Software for vised machine learning models to expedite
Annotation the data preparation process.
LLM-assisted annotation tools and software are 1
As of now, available only in JavaScript/TypeScript and
invaluable resources designed specifically to fa- Python languages.
Figure 2: Stack AI dashboard. They provide a visual interface for users to design and track the AI workflow.

library (Honnibal and Montani, 2017), of-


fers rule-based, statistical models, and LLM-
assisted methods for annotation. This tool pro-
vides easy, flexible, and powerful annotation
options such as named entity recognition, span
categorization, and classification/labeling for
different modalities including text, audio, and
vision. Moreover, it can be easily integrated
with large language models which are capa-
ble of zero- or few-shot learning, while also
Figure 3: UBIAI annotation result on a pdf document. offering services and quantifiable methods for
All the entities in the text of the document have been crafting prompts to address any noisy out-
identified, annotated, and color-coded based on the type. comes. This tool is not open-source.
This image has been borrowed from the videos provided
in the UBIAI documentation (Amamou, 2021).
B Collections of Papers on LLM for Data
Annotation
• UBIAI: UBIAI (Amamou, 2021) is a paid
annotation tool that offers multilingual cloud- This collection of tables provides a concise
based solutions and services in Natural Lan- overview of using Large Language Models (LLMs)
guage Processing. The company aims to aid for data annotation, including state-of-the-art tech-
users in extracting valuable insights from un- niques, methodologies, and practical applications.
structured documents. This tool not only pro- Table 2 lists significant papers on LLM-based data
vides a user interface that facilitates manual annotation, detailing their methods, core technolo-
labeling but also offers several auto-labeling gies, publication venues, and links to resources.
functionalities such as LLM-assisted zero- Table 3 focuses on evaluating the quality of LLM-
and few-shot labeling and model-assisted la- generated annotations. Tables 4,5,6 explore strate-
beling. They also provide integration to vari- gies for learning with LLM-generated annotations,
ous models on huggingface (Wolf et al., 2020) covering domain-specific inference, knowledge dis-
as well as an environment to fine-tune differ- tillation, fine-tuning, and prompting techniques like
ent models on the user’s labeled data. In-Context Learning and Instruction Tuning. Each
table clearly outlines the scenarios, techniques,
• Prodigy: Prodigy (Montani and Honnibal, technologies, venues, and available resources, serv-
2018), designed by the creators of spaCy ing as a guide to the latest in LLM-driven data
annotation and its implications for the future of
automated data processing and machine learning
research.
Paper Scenario Technique Backbone Venue Code/Data Link
Manually Engineered Prompts
RAFT: Reward rAnked FineTuning for
Generative Foundation Model Alignment[1] Unsupervised Zero-Shot LLaMA Arxiv’23 Link
ZeroGen: Efficient Zero-shot Learning
via Dataset Generation[2] Unsupervised Zero-Shot GPT-2 EMNLP’22 Link
BART
Constrained Language Models Yield Alignment Tuning GPT-2
Few-Shot Semantic Parsers[3] Supervised Few-Shot GPT-3 EMNLP’21 Link
Learning To Retrieve Prompts for
In-Context Learning[4] Unsupervised Few-Shot BERT NAACL-HLT’22 Link
Small Models are Valuable Plug-ins RoBERTa
for Large Language Models[5] Supervised Few-Shot XLM-V Arxiv’23 Link
Alignment via Pairwise Feedback
Why can GPT learn in-context?
language models secretly perform GPT 1.3B
gradient descent as meta-optimizers[6] Supervised Human Feedback GPT 2.7B ACL’23 Link
Fine-Tuning Language Models
from Human Preferences[7] Unsupervised Human Feedback GPT-2 Arxiv’19 Link
Fine-tuning language models to find
Zero-Shot
agreement among humans with Few-Shot
diverse preferences [8] Unsupervised Human Feedback Chinchilla NeurIPS’22 Not Available
Teaching language models to support
answers with verified quotes[9] Unsupervised Automated Feedback Gopher Arxiv’22 Link
Learning to summarize with Zero-Shot
human feedback[10] Supervised Automated Feedback GPT-3 NeurIPS’20 Link

[1]
Note: (Dong et al., 2023); [2] (Ye et al., 2022); [3] (Shin et al., 2021); [4] (Rubin et al., 2022); [5] (Xu et al., 2023); [6] (Dai et al.,
[7]
2023); (Ziegler et al., 2019); [8] (Bakker et al., 2022); [9] (Menick et al., 2022); [10] (Stiennon et al., 2020); .

Table 2: A list of representative LLM-Based Data Annotation papers with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Evaluation
The Turking Test: Can Language
Models Understand Instructions?[1] Supervised Human Centric GPT-2 Arixv’20 Not Available
Unnatural Instructions: Tuning Language
Models with (Almost) No Human Labor[2] Unsupervised Human Centric T5 Arixv’22 Link
Open-Source Large Language
Models Outperform Crowd Workers and Automatic
Approach ChatGPT in Text-Annotation Tasks[3] unsupervised Huamn Centric ChatGP Arixv’23 Not Available
Data Selection Via Active Learning
Active Learning for Sequence BiLSTM
BERT
Tagging with Deep Pre-trained Models Distill-BERT
and Bayesian Uncertainty Estimates[4] Semi-Supervised In-Context Learning ELECTRA EACL’21 Not Available
Active learning helps pretrained BiT
models learn the intended task[5] Semi-Supervised In-Context Learning Roberta Arxiv’22 Link
Active Learning Principles for In-Context GPT
Learning with Large Language Models[6] Supervised In-Contect Learning OPT EMNLP’23 Not Available
Large Language Models as Annotators:
Enhancing Generalization of NLP
Models at Minimal Cost[7] Semi-Supervised In-Context Learning GPT-3.5 turbo Arxiv’23 Not Available
ScatterShot: Interactive In-context
Example Curation for Text
Transformation[8] Unsupervised In-Context Learning GPT-3 IUI’23 Link
Prefer to Classify: Improving Text
Classifiers via Auxiliary Preference
Learning[9] Supervised In-Context Learning GPT-3 ICML’23 Link

[1]
Note: (Efrat and Levy, 2020); [2] (Honovich et al., 2022a); [3] (Alizadeh et al., 2023); [4] (Shelmanov et al., 2021); [5] (Tamkin
et al., 2022); [6] (Margatina et al., 2023); [7] (Bansal and Sharma, 2023); [8] (Wu et al., 2023a); [9] (Kim et al., 2023);.

Table 3: A list of representative Assessing LLM-Generated Annotations papers with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Target Domain Inference
An Information-theoretic Approach GPT2
GPT3
to Prompt Engineering Without GPT-Neo
Ground Truth Labels[1] Unsupervised Predicting Labels GPT-J ACL’22 Link
GPT-3
PaLM
FLAN
Emergent Abilities of Large LaMDA
Language Models[2] Unsupervised Predicting Labels Chinchilla TMLR’22 Not Available
GPT3
PaLM
GPT-Neo
Large Language Models are GPT-J
Zero-Shot Reasoners[3] Unsupervised Predicting Labels OPT NeurIPS’22 Link
ELMo
Language Models as Knowledge Bases?[4] Unsupervised Predicting Labels BERT EMNLP’19 Link
Causal Reasoning and Large Language Models: GPT3.5
Opening a New Frontier for Causality [5] Unsupervised Predicting Labels GPT4 Arixv’23 Not Available
Alpaca
Vicuna
LLama2
Large Language Models are Zero-Shot GPT3.5
Rankers for Recommender Systems[6] Unsupervised Predicting Labels GPT4 ECIR’24 Link
Learning Transferable Visual Models
From Natural Language Supervision[7] Unsupervised Inferring Additional Attributes Transformer PMLR’21 Link
Visual Classification via Description
from Large Language Models[8] Unsupervised Inferring Additional Attributes GPT3 Arixv’22 Not Available
Knowledge Distillation
PaLM
Teaching Small Language GPT-3
Models to Reason[9] Unsupervised Chain-of-Thought T5 Arxiv’22 Not Available
Specializing Smaller Language Models GPT-3.5
towards Multi-Step Reasoning [10] Unsupervised Chain-of-Thought T5 Arxiv’23 Not Available
Is ChatGPT Good at Search?
Investigating Large Language ChatGPT
Models as Re-Ranking Agents[11] Unsupervised Chain-of-Thought GPT-4 EMNLP’23 Not Available
Distilling Step-by-Step! Outperforming
Larger Language Models with Less PaLM
Training Data and Smaller Model Sizes[12] Semi-Supervised Chain-of-Thought T5 ACL’23 Link
GPT4All: Training an Assistant-style
GPT-3.5-Turbo
Chatbot with Large Scale Data LLaMA
Distillation from GPT-3.5-Turbo[13] Unsupervised Input-Output Prompting LoRA GitHub’23 Link
GKD: A General Knowledge Distillation Unsupervised
Framework for Large-scale Pre-trained Semi-supervised BERT
Language Model[14] Supervised Input-Output Prompt GLM ACL’23 Link
Lion: Adversarial Distillation of Instruction Tuning ChatGPT
Proprietary Large Language Models[15] Unsupervised Chain-of-Thought GPT-4 EMNLP’23 Link
GPT2
Knowledge Distillation of OPT
LLama
Large Language Models[16] Supervised Instruction Tuning GPT-J Arxiv’23 Link
Distilling Large Language Models
for Biomedical Knowledge Extraction: GPT3.5
A Case Study on Adverse Drug Events[17] Supervised Instruction Tuning GPT4 Arxiv’23 Not Available
Web Content Filtering through knowledge T5
distillation of Large Language Models[18] Supervised Input-Output Prompt GPT3 Arxiv’23 Not Available

[1]
Note: (Sorensen et al., 2022); [2] (Wei et al., 2022a); [3] (Kojima et al., 2022); [4] (Petroni et al., 2019); [5] (Kıcıman et al., 2023);
[6]
(Hou et al., 2023); [7] (Radford et al., 2021); [8] (Menon and Vondrick, 2022); [9] (Magister et al., 2022); [10] (Fu et al., 2023);
[11]
(Sun et al., 2023) [12] (Hsieh et al., 2023); [13] (Anand et al., 2023); [14] (Tan et al., 2023a); [15] (Jiang et al., 2023b); [16] (Gu
et al., 2023c); [17] (Gu et al., 2023b); [18] (Vörös et al., 2023); .

Table 4: A list of representative Learning with LLM-Generated Annotations papers for Target Domain Inference
and Knowledge Distillation with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Fine-Tuning and Prompting - In-Context Learning
Language Models are Few-Shot Learners[1] Supervised In-Context Learning GPT-3 NeurIPS’20 Not Available
Active Learning Principles for In-Context GPT
Learning with Large Language Models[2] Supervised In-Context Learning OPT EMNLP’23 Not Available
Selective Annotation Makes Language GPT-J
Models Better Few-Shot Learners[3] Supervised In-Contect Learning Codex-davinci-002 Arxiv’22 Link
Instruction Induction: From Few Examples GPT-3
to Natural Language Task Descriptions[4] Unsupervised In-Context Learning InstructGPT Arxiv’22 Link
Synthetic Prompting: Generating
Chain-of-Thought Demonstrations
for Large Language Models [5] Unsupervised In-Context Learning InstructGPT ICML’23 Not Available
Improving In-Context Few-Shot
Learning via Self-Supervised Training [6] Supervised In-Context Learning RoBERTa NAACL’22 Not Available
Fine-Tuning and Prompting - Chain-of-Thought Prompting
A Diverse Corpus for Evaluating LCA++
and Developing English Math UnitDep
Word Problem Solvers[7] Supervised Chain-of-Thought GTS ACL’20 Link
GPT-3
LaMDA
PaLM
Chain-of-Thought Prompting Elicits UL2 20B
Reasoning in Large Language Models[8] Supervised Chain-of-Thought Codex NeurIPS’22 Not Available
Instruct-GPT3
GPT-2
GPT-Neo
GPT-J
Large Language Models are T0
Zero-Shot Reasoners[9] Unsupervised Chain-of-Thought OPT NeurIPS’22 Not Available
Automatic chain of thought prompting Supervised GPT-3
in large language models[10] Unsupervised Chain-of-Thought Codex ICLR’23 Link
Rationale-augmented ensembles in PaLM
language models[11] Semi-Supervised Chain-of-Thought GPT-3 Arxiv’22 Not Available
Specializing Smaller Language Models GPT-3.5
towards Multi-Step Reasoning [12] Unsupervised Chain-of-Thought T5 Arxiv’23 Not Available
SCOTT: Self-Consistent Chain-of-Thought GPT-neox
Distillation[13] Supervised Chain-of-Thought T5 Arxiv’22 Not Available

[1]
Note: (Brown et al., 2020); [2] (Margatina et al., 2023); [3] (Hongjin et al., 2022); [4] (Honovich et al., 2022b); [5] (Shao et al.,
[6] [7] [8] [9] [10]
2023); (Chen et al., 2022); (Miao et al., 2021); (Wei et al., 2022b); (Kojima et al., 2022); (Zhang et al., 2022);
[11] [12] [13]
(Wang et al., 2022a); (Fu et al., 2023); (Wang et al., 2023a); .

Table 5: A list of representative Learning with LLM-Generated Annotations papers for Fine-Tuning and Prompting
(In-Context Learning and Chain-of-Thought) with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Fine-Tuning and Prompting - Instruction Tuning

Scaling Instruction-finetuned Language T5


PaLM
Models[1] Unsupervised Instruction Tuning U-PaLM Arxiv’22 Link
Crosslingual Generalization through BLOOM
Multitask Finetuning[2] Supervised Instruction Tuning T5 ACL’23 Link
Self-Instruct: Aligning Language
Models with Self-Generated Instructions[3] Supervised Instruction Tuning GPT-3 ACL’23 Link
Language Models are Few-Shot Learners[4] Supervised Instruction Tuning GPT-3 NeurIPS’20 Not Available
LLaMA: Open and Efficient Foundation
Language Models[5] Unsupervised Instruction Tuning LLaMA Arxiv’23 Link
Can Large Language Models Be
an Alternative to Human Evaluations? [6] Unsupervised Instruction Tuning GPT-2 ACL’23 Not Available
Super-NaturalInstructions: Generalization via GPT-3
Declarative Instructions on 1600+ NLP Tasks[7] Supervised Instruction Tuning T5 EMNLP’22 Link
Fine-Tuning and Prompting - Alignment Tuning
Fine-Tuning Language Models
from Human Preferences[8] Supervised Alignment Tuning GPT-2 Arxiv’19 Link
CTRL: A Conditional Transformer Language
Model for Controllable Generation [9] Supervised Alignment Tuning CTRL Arxiv’19 Link
Chain of hindsight aligns language GPT-J
models with feedback[10] Supervised Alignment Tuning OPT Arxiv’23 Link
Pretraining Language Models with
Human Preferences[11] Supervised Alignment Tuning GPT-2 PMLR’23 Link
Training language models to follow
instructions with human feedback[12] Supervised Alignment Tuning GPT-3 NeurIPS’22 Not Available
Llama 2: Open Foundation and
Fine-Tuned Chat Models[13] Supervised Alignment Tuning Llama 1 Arxiv’23 Link
Offline RL for Natural Language
Generation with Implicit Language
Q Learning[14] Supervised Alignment Tuning GPT-2 ICLR’23 Link
Teaching language models to support
answers with verified quotes[15] Supervised Alignment Tuning Gopher Arxiv’22 Link
RLAIF: Scaling Reinforcement Learning
from Human Feedback with AI Feedback[16] Supervised Alignment Tuning PaLM 2 Arxiv’23 Not Available

[1]
Note: (Chung et al., 2022); [2] (Muennighoff et al., 2022); [3] (Wang et al., 2022b); [4] (Brown et al., 2020); [5] (Touvron et al.,
[6] [7] [8] [9] [10]
2023a); (Chiang and Lee, 2023); (Wang et al., 2022c); (Ziegler et al., 2019); (Keskar et al., 2019); (Liu et al.,
[11] [12] [13] [14] [15]
2023a); (Korbak et al., 2023); (Ouyang et al., 2022); (Touvron et al., 2023b); (Snell et al., 2022); (Menick et al.,
2022); [16] (Lee et al., 2023) .

Table 6: A list of representative Learning with LLM-Generated Annotations papers for Fine-Tuning and Prompting
(Instruction Tuning and Alignment Tuning) with open-source code/data.

You might also like