Large Language Models For Data Annotation - A Survey
Large Language Models For Data Annotation - A Survey
Zhen Tan♠∗ Alimohammad Beigi♠∗ Song Wang♣ Ruocheng Guo♦ Amrita Bhattacharjee♠
Bohan Jiang♠ Mansooreh Karami♠ Jundong Li♣ Lu Cheng ♥ Huan Liu♠
♠
School of Computing, and Augmented Intelligence, Arizona State University
♣
Department of Electrical and Computer Engineering, the University of Virginia
♦
ByteDance Research ♥ Department of Computer Science, University of Illinois Chicago
{ztan36,abeigi,abhatt43,bjiang14,mkarami,huanliu}@asu.edu
{sw3wv,jundong}@@virginia.edu
[email protected], [email protected]
Abstract preference labels to tailor outputs to specific crite-
ria or user needs, ❺ annotating entity relationships
Data annotation is the labeling or tagging of
raw data with relevant information, essential to understand how entities within a dataset interact
for improving the efficacy of machine learn- with each other (Wadhwa et al., 2023), ❻ marking
arXiv:2402.13446v1 [cs.CL] 21 Feb 2024
ing models. The process, however, is labor- semantic roles to define the underlying roles that
intensive and expensive. The emergence of entities play in a sentence (Larionov et al., 2019),
advanced Large Language Models (LLMs), ex- and ❼ tagging temporal sequences to capture the
emplified by GPT-4, presents an unprecedented order of events or actions (Yu et al., 2023).
opportunity to revolutionize and automate the
Data annotation poses significant challenges for
intricate process of data annotation. While ex-
isting surveys have extensively covered LLM current machine learning models due to the com-
architecture, training, and general applications, plexity, subjectivity, and diversity of data, requir-
this paper uniquely focuses on their specific ing domain expertise and the resource-intensive
utility for data annotation. This survey con- nature of manually labeling large datasets. Ad-
tributes to three core aspects: LLM-Based Data vanced LLMs such as GPT-4 (OpenAI, 2023),
Annotation, Assessing LLM-generated Anno- Gemini (Team et al., 2023) and Llama-2 (Touvron
tations, and Learning with LLM-generated an-
et al., 2023b) offer a promising opportunity to revo-
notations. Furthermore, the paper includes an
in-depth taxonomy of methodologies employ-
lutionize data annotation. LLMs serve as more than
ing LLMs for data annotation, a comprehen- just tools but play a crucial role in improving the ef-
sive review of learning strategies for models fectiveness and precision of data annotation. Their
incorporating LLM-generated annotations, and ability to automate annotation tasks (Zhang et al.,
a detailed discussion on primary challenges and 2022), ensure consistency across large volumes of
limitations associated with using LLMs for data data (Hou et al., 2023), and adapt through fine-
annotation. As a key guide, this survey aims to tuning or prompting for specific domains (Song
direct researchers and practitioners in explor-
et al., 2023), significantly reduces the challenges
ing the potential of the latest LLMs for data
annotation, fostering future advancements in encountered with traditional annotation methods,
this critical domain. We provide a compre- setting a new standard for what is achievable in
hensive papers list at https://github.com/ the realm of NLP. This survey delves into the nu-
Zhen-Tan-dmml/LLM4Annotation.git. ances of using LLMs for data annotation, explor-
ing methodologies, learning strategies, and asso-
1 Introduction ciated challenges in this transformative approach.
In the complex realm of machine learning and NLP, Through this exploration, our goal is to shed light
data annotation stands out as a critical yet chal- on the motivations behind embracing LLMs as cata-
lenging step, transcending simple label attachment lysts for redefining the landscape of data annotation
to encompass a rich array of auxiliary predictive in machine learning and NLP.
information. This detailed process typically in- We navigate the terrain of leveraging the latest
volves ❶ categorizing raw data with class or task breed of LLMs for data annotation. The survey
labels for basic classification, ❷ adding intermedi- makes four main contributions:
ate labels for contextual depth (Yu et al., 2022), ❸ • LLM-Based Data Annotation: We dive into the
assigning confidence scores to gauge annotation re- specific attributes (e.g., language comprehen-
liability (Lin et al., 2022), ❹ applying alignment or sion, contextual understanding), capabilities (e.g.,
∗
Equal contribution. text generation, contextual reasoning), and fine-
tuning or prompting strategies (e.g., prompt en- Preliminaries LLM-Based Data Annotation
gineering, domain-specific fine-tuning) of newer Scenarios Manually Engineered Prompts
LLMs like GPT-4 and Llama-2 that make them Fully Supervised
Zero-Shot Few-Shot
Learning
uniquely suited for annotation tasks. Semi-Supervised
• Assessing LLM-Generated Annotations: We ex- Learning Alignment via Pairwise Feedback
Commitment to Fairness. Ensure the develop- Meysam Alizadeh, Maël Kubli, Zeynab Samei,
Shirin Dehghani, Juan Diego Bermeo, Maria Ko-
ment and application of LLMs for data annotation
robeynikova, and Fabrizio Gilardi. 2023. Open-
adheres to ethical principles that promote fairness source large language models outperform crowd
and prevent bias, recognizing the diversity of data workers and approach chatgpt in text-annotation
and avoiding discriminatory outcomes. tasks. arXiv preprint arXiv:2307.02179.
Transparency and Accountability. Maintain Hussam Alkaissi and Samy I McFarlane. 2023. Artifi-
transparency in LLM methodologies, training data, cial hallucinations in chatgpt: implications in scien-
and annotation processes. Provide clear documen- tific writing. Cureus, 15(2).
tation and accountability mechanisms to address Walid Amamou. 2021. Ubiai: Text annotation tool.
potential errors or biases introduced by LLMs.
Yuvanesh Anand, Zach Nussbaum, Brandon Duder-
Privacy and Data Protection. Maintain robust stadt, Benjamin Schmidt, and Andriy Mulyar. 2023.
data privacy protocols, ensuring confidentiality and Gpt4all: Training an assistant-style chatbot with large
consent in training and annotation datasets. scale data distillation from gpt-3.5-turbo. GitHub.
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Lu Cheng, Kush R Varshney, and Huan Liu. 2021. So-
Deep Ganguli, Tom Henighan, Andy Jones, Nicholas cially responsible ai algorithms: Issues, purposes,
Joseph, Ben Mann, Nova DasSarma, et al. 2021. A and challenges. Journal of Artificial Intelligence Re-
general language assistant as a laboratory for align- search, 71:1137–1181.
ment. arXiv preprint arXiv:2112.00861.
Cheng-Han Chiang and Hung-yi Lee. 2023. Can large
Razvan Azamfirei, Sapna R Kudchadkar, and James language models be an alternative to human evalua-
Fackler. 2023. Large language models and the perils tions? arXiv preprint arXiv:2305.01937.
of their hallucinations. Critical Care, 27(1):1–2.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Michiel Bakker, Martin Chadwick, Hannah Sheahan, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Michael Tessler, Lucy Campbell-Gillingham, Jan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
Balaguer, Nat McAleese, Amelia Glaese, John Stoica, and Eric P. Xing. 2023. Vicuna: An open-
Aslanides, Matt Botvinick, et al. 2022. Fine-tuning source chatbot impressing GPT-4 with 90%* chatgpt
language models to find agreement among humans quality.
with diverse preferences. Advances in Neural Infor-
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
mation Processing Systems, 35:38176–38189.
tic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Ad-
Parikshit Bansal and Amit Sharma. 2023. Large lan- vances in neural information processing systems, 30.
guage models as annotators: Enhancing generaliza-
tion of nlp models at minimal cost. arXiv preprint Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
arXiv:2306.15766. ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
Ning Bian, Peilin Liu, Xianpei Han, Hongyu Lin, Yao- 2022. Scaling instruction-finetuned language models.
jie Lu, Ben He, and Le Sun. 2023. A drop of ink arXiv preprint arXiv:2210.11416.
may make a million think: The spread of false in-
formation in large language models. arXiv preprint Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and
arXiv:2305.04812. Li Yuan. 2023. Chatlaw: Open-source legal large
language model with integrated external knowledge
Samuel R Bowman, Gabor Angeli, Christopher Potts, bases. arXiv preprint arXiv:2306.16092.
and Christopher D Manning. 2015. A large annotated
corpus for learning natural language inference. arXiv Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming
preprint arXiv:1508.05326. Ma, Zhifang Sui, and Furu Wei. 2023. Why can gpt
learn in-context? language models secretly perform
Tom Brown, Benjamin Mann, Nick Ryder, Melanie gradient descent as meta-optimizers. In Findings of
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind the Association for Computational Linguistics: ACL
Neelakantan, Pranav Shyam, Girish Sastry, Amanda 2023, pages 4005–4019.
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
systems, 33:1877–1901. Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, ing. arXiv preprint arXiv:1810.04805.
Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi,
Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Danica Dillion, Niket Tandon, Yuling Gu, and Kurt
Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. Gray. 2023. Can ai language models replace human
2023. A survey on evaluation of large language mod- participants? Trends in Cognitive Sciences.
els.
Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan,
Shizhe Diao, Jipeng Zhang, Kashun Shum, and
Canyu Chen and Kai Shu. 2023. Can llm-generated Tong Zhang. 2023. Raft: Reward ranked finetuning
misinformation be detected? arXiv preprint for generative foundation model alignment. arXiv
arXiv:2309.13788. preprint arXiv:2304.06767.
Mingda Chen, Jingfei Du, Ramakanth Pasunuru, Todor Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy-
Mihaylov, Srini Iyer, Veselin Stoyanov, and Zor- ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and
nitsa Kozareva. 2022. Improving in-context few-shot Zhifang Sui. 2022. A survey for in-context learning.
learning via self-supervised training. In NAACL. arXiv preprint arXiv:2301.00234.
Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang,
Sabharwal, and Kyle Richardson. 2023. Disco: Dis- Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy
tilling counterfactuals with large language models. Liang, and Tatsunori B Hashimoto. 2023. Al-
In Proceedings of the 61st Annual Meeting of the pacafarm: A simulation framework for methods
Association for Computational Linguistics (Volume that learn from human feedback. arXiv preprint
1: Long Papers), pages 5514–5528. arXiv:2305.14387.
Avia Efrat and Omer Levy. 2020. The turking test: Can Xingwei He, Zhenghao Lin, Yeyun Gong, Alex Jin,
language models understand instructions? arXiv Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan
preprint arXiv:2010.11982. Duan, Weizhu Chen, et al. 2023. Annollm: Making
large language models to be better crowdsourced
Tyna Eloundou, Sam Manning, Pamela Mishkin, and annotators. arXiv preprint arXiv:2303.16854.
Daniel Rock. 2023. Gpts are gpts: An early look at
the labor market impact potential of large language SU Hongjin, Jungo Kasai, Chen Henry Wu, Weijia Shi,
models. arXiv preprint arXiv:2303.10130. Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf,
Luke Zettlemoyer, Noah A Smith, et al. 2022. Selec-
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and tive annotation makes language models better few-
Tushar Khot. 2023. Specializing smaller language shot learners. In ICLR.
models towards multi-step reasoning. arXiv preprint
arXiv:2301.12726. Matthew Honnibal and Ines Montani. 2017. spaCy 2:
Natural language understanding with Bloom embed-
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wal- dings, convolutional neural networks and incremental
lace, Pieter Abbeel, Sergey Levine, and Dawn Song. parsing. To appear.
2023. Koala: A dialogue model for academic re-
search. BAIR Blog. Or Honovich, Thomas Scialom, Omer Levy, and Timo
Schick. 2022a. Unnatural instructions: Tuning lan-
Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John guage models with (almost) no human labor. arXiv
Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, preprint arXiv:2212.09689.
Laura Weidinger, Martin Chadwick, Phoebe Thacker,
et al. 2022. Improving alignment of dialogue agents Or Honovich, Uri Shaham, Samuel R Bowman, and
via targeted human judgements. arXiv preprint Omer Levy. 2022b. Instruction induction: From few
arXiv:2209.14375. examples to natural language task descriptions. arXiv
preprint arXiv:2205.10782.
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami,
Bailan He, Gengyuan Zhang, Ruotong Liao, Yao
Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu,
Qin, Volker Tresp, and Philip Torr. 2023a. A sys-
Ruobing Xie, Julian McAuley, and Wayne Xin
tematic survey of prompt engineering on vision- Zhao. 2023. Large language models are zero-shot
language foundation models. arXiv preprint rankers for recommender systems. arXiv preprint
arXiv:2307.12980. arXiv:2305.08845.
Yu Gu, Sheng Zhang, Naoto Usuyama, Yonas Wold-
esenbet, Cliff Wong, Praneeth Sanapathi, Mu Wei, Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh,
Naveen Valluri, Erika Strandberg, Tristan Naumann, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner,
et al. 2023b. Distilling large language models for Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister.
biomedical knowledge extraction: A case study on ad- 2023. Distilling step-by-step! outperforming larger
language models with less training data and smaller
verse drug events. arXiv preprint arXiv:2307.06439.
model sizes. arXiv preprint arXiv:2305.02301.
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.
2023c. Knowledge distillation of large language Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell,
models. arXiv preprint arXiv:2306.08543. and Iz Beltagy. 2023. Large language model dis-
tillation doesn’t need a teacher. arXiv preprint
Arnav Gudibande, Eric Wallace, Charles Burton Snell, arXiv:2305.14864.
Xinyang Geng, Hao Liu, P. Abbeel, Sergey Levine,
and Dawn Song. 2023. The false promise of imitating Bohan Jiang, Zhen Tan, Ayushi Nirmal, and Huan
proprietary llms. ArXiv, abs/2305.15717. Liu. 2023a. Disinformation detection: An evolv-
ing challenge in the age of llms. arXiv preprint
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio arXiv:2309.15847.
César Teodoro Mendes, Allie Del Giorno, Sivakanth
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei
de Rosa, Olli Saarikivi, et al. 2023a. Textbooks are Wang. 2023b. Lion: Adversarial distillation of
all you need. arXiv preprint arXiv:2306.11644. closed-source large language model. arXiv preprint
arXiv:2305.12870.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Ce-
sar Teodoro Mendes, Allison Del Giorno, Sivakanth Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
Gopi, Mojan Javaheripi, Piero C. Kauffmann, Gus- Caiming Xiong, and Richard Socher. 2019. Ctrl: A
tavo de Rosa, Olli Saarikivi, Adil Salim, S. Shah, conditional transformer language model for control-
Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, lable generation. arXiv preprint arXiv:1909.05858.
Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and
Yuan-Fang Li. 2023b. Textbooks are all you need. Emre Kıcıman, Robert Ness, Amit Sharma, and Chen-
ArXiv, abs/2306.11644. hao Tan. 2023. Causal reasoning and large language
models: Opening a new frontier for causality. arXiv
Chase Harrison. 2022. Langchain. preprint arXiv:2305.00050.
Jaehyung Kim, Jinwoo Shin, and Dongyeop Kang. large language models’ alignment. arXiv preprint
2023. Prefer to classify: Improving text classifiers arXiv:2308.05374.
via auxiliary preference learning. arXiv preprint
arXiv:2306.04925. Renze Lou, Kai Zhang, and Wenpeng Yin. 2023. Is
prompt all you need? no. a comprehensive and
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- broader view of instruction learning. arXiv preprint
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- arXiv:2303.10475.
guage models are zero-shot reasoners. Advances in
neural information processing systems, 35:22199– Lucie Charlotte Magister, Jonathan Mallinson, Jakub
22213. Adamek, Eric Malmi, and Aliaksei Severyn. 2022.
Teaching small language models to reason. arXiv
Tomasz Korbak, Kejian Shi, Angelica Chen, preprint arXiv:2212.08410.
Rasika Vinayak Bhalerao, Christopher Buck-
ley, Jason Phang, Samuel R Bowman, and Ethan Katerina Margatina, Timo Schick, Nikolaos Aletras, and
Perez. 2023. Pretraining language models with Jane Dwivedi-Yu. 2023. Active learning principles
human preferences. In International Conference on for in-context learning with large language models.
Machine Learning, pages 17506–17533. PMLR. arXiv preprint arXiv:2305.14264.
Daniil Larionov, Artem Shelmanov, Elena Chistova, and Jacob Menick, Maja Trebacz, Vladimir Mikulik,
Ivan Smirnov. 2019. Semantic role labeling with pre- John Aslanides, Francis Song, Martin Chadwick,
trained language models for known and unknown Mia Glaese, Susannah Young, Lucy Campbell-
predicates. In Proceedings of the International Con- Gillingham, Geoffrey Irving, et al. 2022. Teaching
ference on Recent Advances in Natural Language language models to support answers with verified
Processing (RANLP 2019), pages 619–628. quotes. arXiv preprint arXiv:2203.11147.
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Sachit Menon and Carl Vondrick. 2022. Visual classi-
Lu, Thomas Mesnard, Colton Bishop, Victor Car- fication via description from large language models.
bune, and Abhinav Rastogi. 2023. Rlaif: Scaling arXiv preprint arXiv:2210.07183.
reinforcement learning from human feedback with ai
feedback. Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su.
2021. A diverse corpus for evaluating and developing
Dawei Li, Zhen Tan, Tianlong Chen, and Huan Liu. english math word problem solvers. arXiv preprint
2024. Contextualization distillation from large lan- arXiv:2106.15772.
guage model for knowledge graph completion. arXiv
preprint arXiv:2402.01729. Bonan Min, Hayley Ross, Elior Sulem, Amir
Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz,
Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Eneko Agirre, Ilana Heintz, and Dan Roth. 2021.
Wang. 2023. A survey on fairness in large language Recent advances in natural language processing via
models. arXiv preprint arXiv:2308.10149. large pre-trained language models: A survey. ACM
Computing Surveys.
Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai
transparency in the age of llms: A human-centered Ines Montani and Matthew Honnibal. 2018. Prodigy: A
research roadmap. arXiv preprint arXiv:2306.01941. new annotation tool for radically efficient machine
teaching. Artificial Intelligence, to appear.
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022.
Teaching models to express their uncertainty in Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
words. arXiv preprint arXiv:2205.14334. Adam Roberts, Stella Biderman, Teven Le Scao,
M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey
Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2023a. Schoelkopf, et al. 2022. Crosslingual generaliza-
Chain of hindsight aligns language models with feed- tion through multitask finetuning. arXiv preprint
back. arXiv preprint arXiv:2302.02676, 3. arXiv:2211.01786.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan
Lee. 2023b. Visual instruction tuning. arXiv preprint Wang, Yingbo Zhou, Silvio Savarese, and Caiming
arXiv:2304.08485. Xiong. 2022. Codegen: An open large language
model for code with multi-turn program synthesis.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, arXiv preprint arXiv:2203.13474.
Hiroaki Hayashi, and Graham Neubig. 2023c. Pre-
train, prompt, and predict: A systematic survey of OpenAI. 2023. Gpt-4 technical report.
prompting methods in natural language processing.
ACM Computing Surveys, 55(9):1–35. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, 2022. Training language models to follow instruc-
Muhammad Faaiz Taufiq, and Hang Li. 2023d. Trust- tions with human feedback. Advances in Neural
worthy llms: a survey and guideline for evaluating Information Processing Systems, 35:27730–27744.
Yikang Pan, Liangming Pan, Wenhu Chen, Preslav yield few-shot semantic parsers. In Proceedings of
Nakov, Min-Yen Kan, and William Yang Wang. 2023. the 2021 Conference on Empirical Methods in Natu-
On the risk of misinformation pollution with large ral Language Processing, pages 7699–7715.
language models. arXiv preprint arXiv:2305.13661.
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin
Fabio Petroni, Tim Rocktäschel, Patrick Lewis, An- Gal, Nicolas Papernot, and Ross Anderson. 2023.
ton Bakhtin, Yuxiang Wu, Alexander H Miller, and The curse of recursion: Training on generated data
Sebastian Riedel. 2019. Language models as knowl- makes models forget. ArXiv, abs/2305.17493.
edge bases? arXiv preprint arXiv:1909.01066.
Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya and Sergey Levine. 2022. Offline rl for natural lan-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- guage generation with implicit language q learning.
try, Amanda Askell, Pamela Mishkin, Jack Clark, arXiv preprint arXiv:2206.11871.
et al. 2021. Learning transferable visual models from
natural language supervision. In International confer- Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei
ence on machine learning, pages 8748–8763. PMLR. Huang, Yongbin Li, and Houfeng Wang. 2023. Pref-
erence ranking optimization for human alignment.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano arXiv preprint arXiv:2306.17492.
Ermon, Christopher D Manning, and Chelsea Finn.
Taylor Sorensen, Joshua Robinson, Christopher Michael
2023. Direct preference optimization: Your language
Rytting, Alexander Glenn Shaw, Kyle Jeffrey
model is secretly a reward model. arXiv preprint
Rogers, Alexia Pauline Delorey, Mahmoud Khalil,
arXiv:2305.18290.
Nancy Fulda, and David Wingate. 2022. An
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and information-theoretic approach to prompt engineer-
Percy Liang. 2016. Squad: 100,000+ questions ing without ground truth labels. arXiv preprint
for machine comprehension of text. arXiv preprint arXiv:2203.11364.
arXiv:1606.05250. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
2022. Learning to retrieve prompts for in-context Dario Amodei, and Paul F Christiano. 2020. Learn-
learning. In Proceedings of the 2022 Conference ing to summarize with human feedback. Advances
of the North American Chapter of the Association in Neural Information Processing Systems, 33:3008–
for Computational Linguistics: Human Language 3021.
Technologies, pages 2655–2671. Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Ren, Dawei Yin, and Zhaochun Ren. 2023. Is
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine chatgpt good at search? investigating large lan-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun guage models as re-ranking agent. arXiv preprint
Raja, et al. 2021. Multitask prompted training en- arXiv:2304.09542.
ables zero-shot task generalization. arXiv preprint Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
arXiv:2110.08207. Jonathan Berant. 2018. Commonsenseqa: A question
answering challenge targeting commonsense knowl-
John Schulman, Filip Wolski, Prafulla Dhariwal,
edge. arXiv preprint arXiv:1811.00937.
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint Alex Tamkin, Dat Nguyen, Salil Deshpande, Jesse Mu,
arXiv:1707.06347. and Noah Goodman. 2022. Active learning helps
pretrained models learn the intended task. Advances
Zhihong Shao, Yeyun Gong, Yelong Shen, Min- in Neural Information Processing Systems, 35:28140–
lie Huang, Nan Duan, and Weizhu Chen. 2023. 28153.
Synthetic prompting: Generating chain-of-thought
demonstrations for large language models. arXiv Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wen-
preprint arXiv:2302.00618. wen Gong, Yang Yang, Hongyin Tang, Keqing He,
Jiahao Liu, Jingang Wang, Shu Zhao, et al. 2023a.
A Shelmanov, D Puzyrev, L Kupriyanova, N Khromov, Gkd: A general knowledge distillation framework
DV Dylov, A Panchenko, D Belyakov, D Larionov, for large-scale pre-trained language model. arXiv
E Artemova, and O Kozlova. 2021. Active learning preprint arXiv:2306.06629.
for sequence tagging with deep pre-trained models
and bayesian uncertainty estimates. In EACL 2021- Zhen Tan, Tianlong Chen, Zhenyu Zhang, and Huan
16th Conference of the European Chapter of the Asso- Liu. 2023b. Sparsity-guided holistic explanation for
ciation for Computational Linguistics, Proceedings llms with interpretable inference-time intervention.
of the Conference, pages 1698–1712. arXiv preprint arXiv:2312.15033.
Richard Shin, Christopher Lin, Sam Thomson, Charles Zhen Tan, Lu Cheng, Song Wang, Yuan Bo, Jundong
Chen Jr, Subhro Roy, Emmanouil Antonios Platanios, Li, and Huan Liu. 2023c. Interpreting pretrained lan-
Adam Pauls, Dan Klein, Jason Eisner, and Benjamin guage models via concept bottlenecks. arXiv preprint
Van Durme. 2021. Constrained language models arXiv:2311.05014.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Le, Ed Chi, and Denny Zhou. 2022a. Rationale-
and Tatsunori B. Hashimoto. 2023a. Stanford alpaca: augmented ensembles in language models. arXiv
An instruction-following llama model. https:// preprint arXiv:2207.00747.
github.com/tatsu-lab/stanford_alpaca.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann isa Liu, Noah A Smith, Daniel Khashabi, and Han-
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, naneh Hajishirzi. 2022b. Self-instruct: Aligning lan-
and Tatsunori B Hashimoto. 2023b. Stanford alpaca: guage model with self generated instructions. arXiv
An instruction-following llama model. preprint arXiv:2212.10560.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yizhong Wang, Swaroop Mishra, Pegah Alipoor-
Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, molabashi, Yeganeh Kordi, Amirreza Mirzaei,
Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anjana Arunkumar, Arjun Ashok, Arut Selvan
Anja Hauth, et al. 2023. Gemini: a family of Dhanasekaran, Atharva Naik, David Stap, et al.
highly capable multimodal models. arXiv preprint 2022c. Super-naturalinstructions: Generalization via
arXiv:2312.11805. declarative instructions on 1600+ nlp tasks. arXiv
preprint arXiv:2204.07705.
Arun James Thirunavukarasu, Darren Shu Jeng Ting,
Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xing-
and Daniel Shu Wei Ting. 2023. Large language shan Zeng, Wenyong Huang, Lifeng Shang, Xin
models in medicine. Nature Medicine, pages 1–11. Jiang, and Qun Liu. 2023d. Aligning large lan-
guage models with human: A survey. arXiv preprint
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier arXiv:2307.12966.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
Azhar, et al. 2023a. Llama: Open and effi- Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
cient foundation language models. arXiv preprint drew M Dai, and Quoc V Le. 2021. Finetuned lan-
arXiv:2302.13971. guage models are zero-shot learners. arXiv preprint
arXiv:2109.01652.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Bhosale, et al. 2023b. Llama 2: Open founda- Maarten Bosma, Denny Zhou, Donald Metzler, et al.
tion and fine-tuned chat models. arXiv preprint 2022a. Emergent abilities of large language models.
arXiv:2307.09288. arXiv preprint arXiv:2206.07682.
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris,
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Alessandro Sordoni, Philip Bachman, and Kaheer
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Suleman. 2016. Newsqa: A machine comprehension
et al. 2022b. Chain-of-thought prompting elicits rea-
dataset. arXiv preprint arXiv:1611.09830.
soning in large language models. Advances in Neural
Tamás Vörös, Sean Paul Bergeron, and Konstantin Information Processing Systems, 35:24824–24837.
Berlin. 2023. Web content filtering through knowl-
edge distillation of large language models. arXiv Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
preprint arXiv:2305.05027. Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
Somin Wadhwa, Silvio Amir, and Byron C Wallace. icz, Joe Davison, Sam Shleifer, Patrick von Platen,
2023. Revisiting relation extraction in the era of large Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
language models. arXiv preprint arXiv:2305.05003. Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2020. Hug-
Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan gingface’s transformers: State-of-the-art natural lan-
Gao, Bing Yin, and Xiang Ren. 2023a. Scott: guage processing.
Self-consistent chain-of-thought distillation. arXiv
preprint arXiv:2305.01879. Sherry Wu, Hua Shen, Daniel S Weld, Jeffrey Heer, and
Marco Tulio Ribeiro. 2023a. Scattershot: Interactive
Song Wang, Zhen Tan, Ruocheng Guo, and Jundong Li. in-context example curation for text transformation.
2023b. Noise-robust fine-tuning of pretrained lan- In Proceedings of the 28th International Conference
guage models via external guidance. arXiv preprint on Intelligent User Interfaces, pages 353–367.
arXiv:2311.01108.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski,
Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam-
Chen Chen, et al. 2023c. Knowledge editing for badur, David Rosenberg, and Gideon Mann. 2023b.
large language models: A survey. arXiv preprint Bloomberggpt: A large language model for finance.
arXiv:2310.16218. arXiv preprint arXiv:2303.17564.
Canwen Xu, Yichong Xu, Shuohang Wang, Yang Liu, cilitate the annotation process for various NLP
Chenguang Zhu, and Julian McAuley. 2023. Small tasks. One of their primary attributes is an intu-
models are valuable plug-ins for large language mod-
itive and user-friendly interface, allowing engineers
els. arXiv preprint arXiv:2305.08848.
and even non-technical annotators to easily work
Hongyang Yang, Xiao-Yang Liu, and Christina Dan with complex textual data. These tools are built to
Wang. 2023. Fingpt: Open-source financial large support numerous annotation types, from simple bi-
language models. arXiv preprint arXiv:2306.06031.
nary labels to more intricate hierarchical structures.
Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao The main goal of these tools is to simplify the la-
Feng, Zhiyong Wu, Tao Yu, and Lingpeng Kong. beling process, enhance the quality of the labels,
2022. Zerogen: Efficient zero-shot learning via
dataset generation. In Proceedings of the 2022 Con- and boost overall productivity in data annotation.
ference on Empirical Methods in Natural Language Below, we will present a selection of the libraries
Processing, pages 11653–11669. and tools that support Large Language Models for
Wenhao Yu, Dan Iter, Shuohang Wang, Yichong the annotation process:
Xu, Mingxuan Ju, Soumya Sanyal, Chenguang
Zhu, Michael Zeng, and Meng Jiang. 2022. Gen- • LangChain: LangChain (Harrison, 2022) is
erate rather than retrieve: Large language mod- an open-source library1 that offers an array
els are strong context generators. arXiv preprint of tools designed to facilitate the construc-
arXiv:2209.10063. tion of LLM-related pipelines and workflows.
Xinli Yu, Zheng Chen, Yuan Ling, Shujing Dong, This library specifically provides large lan-
Zongyi Liu, and Yanbin Lu. 2023. Temporal data guage models with agents in order to interact
meets llm–explainable financial time series forecast- effectively with their environment as well as
ing. arXiv preprint arXiv:2306.11025.
various external data sources. Therefore, pro-
Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, viding dynamic and contextually appropriate
Bingchao Wu, Bei Guan, Wang Yongji, and Jian- responses that go beyond a single LLM call.
Guang Lou. 2023. Large language models meet
nl2code: A survey. In Proceedings of the 61st An- In terms of the annotation process, their power
nual Meeting of the Association for Computational mostly lies in the facilitation of annotation
Linguistics (Volume 1: Long Papers), pages 7443–
7464.
through the creation of a modularized struc-
ture called chain. In the chaining technique, a
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex complex problem is broken down into smaller
Smola. 2022. Automatic chain of thought prompt-
sub-tasks. The results obtained from one or
ing in large language models. arXiv preprint
arXiv:2210.03493. more steps are then aggregated and utilized
as input prompts for subsequent actions in the
Mengjie Zhao, Fei Mi, Yasheng Wang, Minglei Li, Xin
chain.
Jiang, Qun Liu, and Hinrich Schütze. 2021. Lm-
turk: Few-shot learners as crowdsourcing workers
in a language-model-as-a-service framework. arXiv • Stack AI: Stack AI (Aceituno and Rosinol,
preprint arXiv:2112.07522. 2022) is a paid service that offers an AI-
powered data platform. It is designed explic-
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
itly for automating business processes allow-
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen ing them to maximize efficiency. The essence
Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, of their platform lies in their ability to visually
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, design, test, and deploy AI workflows through
Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A
smooth integration of Large Language Mod-
survey of large language models.
els. Their user-friendly graphical interface
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B (Figure 2) allows the users to create apps
Brown, Alec Radford, Dario Amodei, Paul Chris- and workflows related to diverse tasks from
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. arXiv content creation and data labeling to conver-
preprint arXiv:1909.08593. sational AI apps and document processing.
Moreover, Stack AI utilizes weakly super-
A LLM-assisted Tools and Software for vised machine learning models to expedite
Annotation the data preparation process.
LLM-assisted annotation tools and software are 1
As of now, available only in JavaScript/TypeScript and
invaluable resources designed specifically to fa- Python languages.
Figure 2: Stack AI dashboard. They provide a visual interface for users to design and track the AI workflow.
[1]
Note: (Dong et al., 2023); [2] (Ye et al., 2022); [3] (Shin et al., 2021); [4] (Rubin et al., 2022); [5] (Xu et al., 2023); [6] (Dai et al.,
[7]
2023); (Ziegler et al., 2019); [8] (Bakker et al., 2022); [9] (Menick et al., 2022); [10] (Stiennon et al., 2020); .
Table 2: A list of representative LLM-Based Data Annotation papers with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Evaluation
The Turking Test: Can Language
Models Understand Instructions?[1] Supervised Human Centric GPT-2 Arixv’20 Not Available
Unnatural Instructions: Tuning Language
Models with (Almost) No Human Labor[2] Unsupervised Human Centric T5 Arixv’22 Link
Open-Source Large Language
Models Outperform Crowd Workers and Automatic
Approach ChatGPT in Text-Annotation Tasks[3] unsupervised Huamn Centric ChatGP Arixv’23 Not Available
Data Selection Via Active Learning
Active Learning for Sequence BiLSTM
BERT
Tagging with Deep Pre-trained Models Distill-BERT
and Bayesian Uncertainty Estimates[4] Semi-Supervised In-Context Learning ELECTRA EACL’21 Not Available
Active learning helps pretrained BiT
models learn the intended task[5] Semi-Supervised In-Context Learning Roberta Arxiv’22 Link
Active Learning Principles for In-Context GPT
Learning with Large Language Models[6] Supervised In-Contect Learning OPT EMNLP’23 Not Available
Large Language Models as Annotators:
Enhancing Generalization of NLP
Models at Minimal Cost[7] Semi-Supervised In-Context Learning GPT-3.5 turbo Arxiv’23 Not Available
ScatterShot: Interactive In-context
Example Curation for Text
Transformation[8] Unsupervised In-Context Learning GPT-3 IUI’23 Link
Prefer to Classify: Improving Text
Classifiers via Auxiliary Preference
Learning[9] Supervised In-Context Learning GPT-3 ICML’23 Link
[1]
Note: (Efrat and Levy, 2020); [2] (Honovich et al., 2022a); [3] (Alizadeh et al., 2023); [4] (Shelmanov et al., 2021); [5] (Tamkin
et al., 2022); [6] (Margatina et al., 2023); [7] (Bansal and Sharma, 2023); [8] (Wu et al., 2023a); [9] (Kim et al., 2023);.
Table 3: A list of representative Assessing LLM-Generated Annotations papers with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Target Domain Inference
An Information-theoretic Approach GPT2
GPT3
to Prompt Engineering Without GPT-Neo
Ground Truth Labels[1] Unsupervised Predicting Labels GPT-J ACL’22 Link
GPT-3
PaLM
FLAN
Emergent Abilities of Large LaMDA
Language Models[2] Unsupervised Predicting Labels Chinchilla TMLR’22 Not Available
GPT3
PaLM
GPT-Neo
Large Language Models are GPT-J
Zero-Shot Reasoners[3] Unsupervised Predicting Labels OPT NeurIPS’22 Link
ELMo
Language Models as Knowledge Bases?[4] Unsupervised Predicting Labels BERT EMNLP’19 Link
Causal Reasoning and Large Language Models: GPT3.5
Opening a New Frontier for Causality [5] Unsupervised Predicting Labels GPT4 Arixv’23 Not Available
Alpaca
Vicuna
LLama2
Large Language Models are Zero-Shot GPT3.5
Rankers for Recommender Systems[6] Unsupervised Predicting Labels GPT4 ECIR’24 Link
Learning Transferable Visual Models
From Natural Language Supervision[7] Unsupervised Inferring Additional Attributes Transformer PMLR’21 Link
Visual Classification via Description
from Large Language Models[8] Unsupervised Inferring Additional Attributes GPT3 Arixv’22 Not Available
Knowledge Distillation
PaLM
Teaching Small Language GPT-3
Models to Reason[9] Unsupervised Chain-of-Thought T5 Arxiv’22 Not Available
Specializing Smaller Language Models GPT-3.5
towards Multi-Step Reasoning [10] Unsupervised Chain-of-Thought T5 Arxiv’23 Not Available
Is ChatGPT Good at Search?
Investigating Large Language ChatGPT
Models as Re-Ranking Agents[11] Unsupervised Chain-of-Thought GPT-4 EMNLP’23 Not Available
Distilling Step-by-Step! Outperforming
Larger Language Models with Less PaLM
Training Data and Smaller Model Sizes[12] Semi-Supervised Chain-of-Thought T5 ACL’23 Link
GPT4All: Training an Assistant-style
GPT-3.5-Turbo
Chatbot with Large Scale Data LLaMA
Distillation from GPT-3.5-Turbo[13] Unsupervised Input-Output Prompting LoRA GitHub’23 Link
GKD: A General Knowledge Distillation Unsupervised
Framework for Large-scale Pre-trained Semi-supervised BERT
Language Model[14] Supervised Input-Output Prompt GLM ACL’23 Link
Lion: Adversarial Distillation of Instruction Tuning ChatGPT
Proprietary Large Language Models[15] Unsupervised Chain-of-Thought GPT-4 EMNLP’23 Link
GPT2
Knowledge Distillation of OPT
LLama
Large Language Models[16] Supervised Instruction Tuning GPT-J Arxiv’23 Link
Distilling Large Language Models
for Biomedical Knowledge Extraction: GPT3.5
A Case Study on Adverse Drug Events[17] Supervised Instruction Tuning GPT4 Arxiv’23 Not Available
Web Content Filtering through knowledge T5
distillation of Large Language Models[18] Supervised Input-Output Prompt GPT3 Arxiv’23 Not Available
[1]
Note: (Sorensen et al., 2022); [2] (Wei et al., 2022a); [3] (Kojima et al., 2022); [4] (Petroni et al., 2019); [5] (Kıcıman et al., 2023);
[6]
(Hou et al., 2023); [7] (Radford et al., 2021); [8] (Menon and Vondrick, 2022); [9] (Magister et al., 2022); [10] (Fu et al., 2023);
[11]
(Sun et al., 2023) [12] (Hsieh et al., 2023); [13] (Anand et al., 2023); [14] (Tan et al., 2023a); [15] (Jiang et al., 2023b); [16] (Gu
et al., 2023c); [17] (Gu et al., 2023b); [18] (Vörös et al., 2023); .
Table 4: A list of representative Learning with LLM-Generated Annotations papers for Target Domain Inference
and Knowledge Distillation with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Fine-Tuning and Prompting - In-Context Learning
Language Models are Few-Shot Learners[1] Supervised In-Context Learning GPT-3 NeurIPS’20 Not Available
Active Learning Principles for In-Context GPT
Learning with Large Language Models[2] Supervised In-Context Learning OPT EMNLP’23 Not Available
Selective Annotation Makes Language GPT-J
Models Better Few-Shot Learners[3] Supervised In-Contect Learning Codex-davinci-002 Arxiv’22 Link
Instruction Induction: From Few Examples GPT-3
to Natural Language Task Descriptions[4] Unsupervised In-Context Learning InstructGPT Arxiv’22 Link
Synthetic Prompting: Generating
Chain-of-Thought Demonstrations
for Large Language Models [5] Unsupervised In-Context Learning InstructGPT ICML’23 Not Available
Improving In-Context Few-Shot
Learning via Self-Supervised Training [6] Supervised In-Context Learning RoBERTa NAACL’22 Not Available
Fine-Tuning and Prompting - Chain-of-Thought Prompting
A Diverse Corpus for Evaluating LCA++
and Developing English Math UnitDep
Word Problem Solvers[7] Supervised Chain-of-Thought GTS ACL’20 Link
GPT-3
LaMDA
PaLM
Chain-of-Thought Prompting Elicits UL2 20B
Reasoning in Large Language Models[8] Supervised Chain-of-Thought Codex NeurIPS’22 Not Available
Instruct-GPT3
GPT-2
GPT-Neo
GPT-J
Large Language Models are T0
Zero-Shot Reasoners[9] Unsupervised Chain-of-Thought OPT NeurIPS’22 Not Available
Automatic chain of thought prompting Supervised GPT-3
in large language models[10] Unsupervised Chain-of-Thought Codex ICLR’23 Link
Rationale-augmented ensembles in PaLM
language models[11] Semi-Supervised Chain-of-Thought GPT-3 Arxiv’22 Not Available
Specializing Smaller Language Models GPT-3.5
towards Multi-Step Reasoning [12] Unsupervised Chain-of-Thought T5 Arxiv’23 Not Available
SCOTT: Self-Consistent Chain-of-Thought GPT-neox
Distillation[13] Supervised Chain-of-Thought T5 Arxiv’22 Not Available
[1]
Note: (Brown et al., 2020); [2] (Margatina et al., 2023); [3] (Hongjin et al., 2022); [4] (Honovich et al., 2022b); [5] (Shao et al.,
[6] [7] [8] [9] [10]
2023); (Chen et al., 2022); (Miao et al., 2021); (Wei et al., 2022b); (Kojima et al., 2022); (Zhang et al., 2022);
[11] [12] [13]
(Wang et al., 2022a); (Fu et al., 2023); (Wang et al., 2023a); .
Table 5: A list of representative Learning with LLM-Generated Annotations papers for Fine-Tuning and Prompting
(In-Context Learning and Chain-of-Thought) with open-source code/data.
Paper Scenario Technique Backbone Venue Code/Data Link
Fine-Tuning and Prompting - Instruction Tuning
[1]
Note: (Chung et al., 2022); [2] (Muennighoff et al., 2022); [3] (Wang et al., 2022b); [4] (Brown et al., 2020); [5] (Touvron et al.,
[6] [7] [8] [9] [10]
2023a); (Chiang and Lee, 2023); (Wang et al., 2022c); (Ziegler et al., 2019); (Keskar et al., 2019); (Liu et al.,
[11] [12] [13] [14] [15]
2023a); (Korbak et al., 2023); (Ouyang et al., 2022); (Touvron et al., 2023b); (Snell et al., 2022); (Menick et al.,
2022); [16] (Lee et al., 2023) .
Table 6: A list of representative Learning with LLM-Generated Annotations papers for Fine-Tuning and Prompting
(Instruction Tuning and Alignment Tuning) with open-source code/data.